arXiv: 1507.01122v3 [cs.AI] 13 May 2016 


Modeling the Mind: A brief review 

Gabriel Makdah 

gabriel.makdah@etu.univ-lyonl.fr 

Universite Claude Bernard Lyon 1, France 

/ - Computational Foundations 




Contents 


1 Introduction 5 

2 The foundations of computers 6 

2.1 Hardware. 6 

2.2 Computing . 7 

3 Representations 7 

3.1 Dimensionality. 8 

3.2 High-dimensional representations . 8 

3.3 High-dimensional memory. 10 

4 The hrain and diversity 11 

4.1 Mixed selectivity. 11 

4.2 Robustness . 12 

4.3 Randomness. 12 

5 Computationalism, and Connectionism 13 

5.1 Computationalism. 13 

5.2 Connectionism. 13 

5.3 The necessity of variable binding. 15 

6 Artificial Neural Networks 17 

6.1 Parallel Distributed Representations. 17 

6.2 Restricted Boltzmann Machines . 18 

6.3 Recurrent Neural Networks. 19 

6.4 Using an external stack memory. 20 

6.5 Long Short-Term Memory. 21 

6.6 Long Term memory. 23 

7 Vector Symbolic Architectures 26 


2 




















7.1 Vector components. 26 

7.2 Vector operations. 27 

7.3 Holographic Reduced Representations:. 28 

8 Working memory 32 

8.1 Biological architecture. 32 

8.2 Memory functions. 33 

8.3 Memory operations in the pre-frontal cortex and basal ganglia. 34 

8.4 Modeling the PFC-BG system. 36 

9 Modeling the cortical architecture 38 

9.1 Multiple layers of representation. 38 

9.2 Training unlabeled data. 39 

9.3 Parallel constraint satisfaction processes. 40 

9.4 Part-whole hierarchies . 40 

9.5 Hierarchical Temporal Memory . 42 

10 Constructing the Code: A recap 43 


3 















Preface 


Modeling the Mind: a brief review is an annual review, available for free on arXiv.org, 
whose aim is to help students and researehers unfamiliar with the field of neuroseienee and 
eomputational neuroseienee gain insight into the fundamentals of this domain of study. 
Creating an aeeurate simulation of the mind is no easy task, and while it took brilliant 
minds deeades to advanee us to where we’re at right now, we are still ways off our hnal goal. 
It is therefore imperative to have more researeh earried out in this multidiseiplinary field, 
taking in help from researehers in biology, neuroseienee, eomputer seienee, but also 
mathematies, physies, ehemistry and imaging, in order to speed up this proeess and tip the 
seales in our favor for the upeoming deeades. This annual review hopes to provide the 
required information for anyone who is eonsidering this domain as his future endeavor. 

The reviews will be taekling relatively global eharaeteristies at first in order to familiarize 
the reader with the basie foundations, and will be getting progressively more speeifie and in 
tune with eurrent researeh in the upeoming parts. This is Part I. It will eontain basie 
information about the eomputational aspeet of this field, and will attempt to explain why 
eertain eoneepts are generally agreed upon, and the intuition behind them, going through 

the essential founding works. 


4 



Abstract 


The brain is a powerful tool used to aehieve amazing feats. There have been several 
signifieant advanees in neuroseienee and artifieial brain researeh in the past two deeades. 
This artiele is a review of sueh advanees, ranging from the eoneepts of eonneetionism, to 
neural network arehiteetures and high-dimensional representations. There have also been 
advanees in biologieally inspired eognitive arehiteetures, of whieh we will eite a few. We 
will be positioning relatively speeifie models in mueh broader perspeetives, while also 
eomparing and eontrasting their advantages and weaknesses. The projeets presented are 
targeted to model the brain at different levels, utilizing different methodologies. 

1 Introduction 

Computational neuroseienee has now existed for several deeades. It started off as an attempt 
to model individual neurons’ bring meehanies, with the Hodgkin-Huxley model[30] in 1952, 
as well as neurons’ learning models, with Hebbian learning[24] in 1949. Today, seientists 
use eomputer simulations not only to model some of the brain’s traits, but also to predieate 
some of its behavior, and study new and emerging aspeets of it. The results produeed using 
simulations often lead to testable predietions in the real world. 

Neural networks were ereated by researehers who wanted to model arehiteetures that 
were similar to that of the brain. From eonstrueting nodes, to learning algorithms sueh as 
baekpropagation, the brain has greatly influeneed neural network development. Ever sinee 
their eoneeption, neural networks exhibited features that were remarkable for eomputers. 
To this day, we are still surprised by their expanding eatalog of eapabilities. They are used 
in tasks sueh as speeeh reeognition, image reeognition, learning grammars and languages, 
roboties, and a wide array of subjeets, that require a human-like thought proeess to eonduet. 
All of this makes artibeial neural networks the prime eontenders to modeling the mind. 

In the upeoming seetions, we will diseuss some of the theories behind neural networks, 
their foundations, the most widely used algorithms, as well as some theories about their 
applieation to the held of neuroseienee. Seetion 3 will be dedieated to examining eomputer 
patterns, and representations in order to take maximal advantage of the eomputable spaee. 
The brain, and its whole speeters of variety will be presented in seetion 4 in order to have a 
better understanding of it, before progressing any further. The eoneepts of eomputationalism, 
and eonneetionism, two theories for modeling the brain, will be diseussed in seetion 5. In 
seetion 6, we will present artibeial neural networks, and talk about the most widely used 
arehiteetures and algorithms, as well as the advantages of eaeh one of them. Seetion 7 will 
go over veetor symbolie arehiteetures and some basie operations and models to optimize 
storage. Seetion 8 will serve as an overview to the brain’s working memory, as well as the 
modeling of the pre-frontal eortex’s memory system, a key eomponent to working memory. 
We will be talking about the main problems that artibeial brain arehiteetures should be 


5 



tackling, as well as give examples on how research has been aiming to solve them in section 
9. The final section, section 10, will serve as a conclusion to the article. We will discuss what 
an optimal model of the brain would look like following the concepts that were discussed in 
this review. 

But first, let us give an overview of what made digital computers possible, and talk 
briefly about some of the basic principles behind them in section 2. 

2 The foundations of computers 

2.1 Hardware 

The grounds of today’s digital computers lie in the Von Neumann architecture. 

The Von Neumann[84] architecture is a computer model described 60 years ago, 
whose success has made computers a ubiquitous part of our lives. The architecture depicts 
an electronic digital computer with different parts. The first, and arguably the most important 
of these, is the processing unit, which contains an Arithmetic Logic Unit, ALU for short, 
as well as processor registers. The control unit houses instruction registers and a program 
counter. It helps directs operations and instructions to the processor. Memory is used to 
store data and instructions, either in rapidly accessible Random-Access Memory, RAM for 
short, or slower, but more capacitive external mass storage, such as hard disk drives, and 
more recently, solid state drives. The machine also needs input hardware in order to take 
queries from the operator, equipment such as keyboards, mice, or pressure sensors are used. 
In order to display the results of its computation, the machine expresses itself through output 
hardware, such as a monitor, speakers, or a robotic arm. 

Data, and the instructions that manipulate data, are entities of the same kind. They are 
stored in the same set of addresses and data buses. The Von Neumann architecture has only 
one memory space. 

A lot of attempts have been made to program computers for the kind of flexible intelligence 
that characterizes human behavior, however with no remarkable success. This has lead many 
to wonder whether a different computing architecture is needed, and whether the problem 
was stemming from a lack of software optimization, or an inherently limiting bottleneck in 
the Von Neumann architecture. Several new architectures were presented throughout the 
years, each tackling different issues. Some are found widely in use today, which prove that 
the Von Neumann model is not the be all and end all of architectures. 

An example of such models would be the Modified Harvard architecture. This model 
has 2 separate memory spaces, compared to only one in the Von Neumann architecture. The 
first memory space is used to read data from, and write data to the memory. The second 
set of addresses and data buses is used to fetch instructions. This results in an increase of 
throughput, seeing as the memory is no longer being limited by memory spaces. While the 
system is executing one instruction, it could be fetching the next one at the same time. 


6 



In our case, the burden falls on representation, the kinds of entities that the computer uses in 
order to process its computations. If our goal was to model the mind, then we ought to be 
using similar representations to those used by the brain. Instead of using low-dimensional 
binary vectors as conventional computers would do, the basic idea here is to compute with 
large vector patterns, called high dimensional random vectors. We will be tackling this 
subject next, but first, let us give a general definition of computing. 

2.2 Computing 

Computing is the transformation of patterns by algorithms which are governed by rules. 
Patterns, also known as representations, are the configuration of ON and OFF switches in 
a sequence. In other words, they are the sequencing of bits. An algorithm helps compute 
a new pattern based on the previous strings of patterns. Patterns can be meaningful when 
they correspond to either abstract entities, or things in the world, such as numbers, words, 
images. 

To transform patterns, computers have circuits, such as the adder circuit for summing 
two numbers, the OR circuit, the AND circuit...etc, whose interpretations are the same as 
their uses in ordinary language. The materials that computers and circuits are made of are 
subordinate. That is to say they are not a major contributor to making the logic possible. 
Materials can be changed as cost of production and development varies, leading to better 
constituents. The logical design can be separated from the materials and the hardware. This 
intelligent design can be seen not only in man-made computers, but in humans themselves. 
Although we are all capable of comparable computing power, our underlying architectures 
are very different. No two people have the same neural hardware, yet the logical design still 
holds from one person to the other. 


3 Representations 

Computers use patterns formed by strings of Os and Is. These are called binary vectors, 
and are used in binary representations. The goal of a certain representation is to be able 
to discriminate between any two objects. In other words, the bit patterns representing two 
different values must differ from one another. 

The meanings which patterns hold are not apparent in their standalone binary vectors. 
Instead, it’s the relations between different patterns that determines their possible use for 
computing. This is also true for the brain. A standalone neuron being activated does not 
yield much, if any, work done. It’s the complete structure of inter-relational patterns that 
generate computation. 

Choosing the correct representation is subject to trade-offs and compromises. There 
are no perfect representations that would be suitable for all tasks. The base-2, binary, system 
is extremely efficient in arithmetic operations and is therefore used in computers. Our brains 


7 



very probably use a different kind of representation, with different trade-offs, better suited 
for vital funetions, eompromising the speed and preeision of arithmetie tasks. 

However, brain-like abilities and funetions ean still be demonstrated in eomputer 
simulations using nothing but binary patterns and operations. This proves that, up to a 
eertain point, the preeise nature of the individual units don’t matter as mueh as the properties 
deriving from them through the speeified operations. 

Representations ean be done through a variety of ways, depending on the type of tasks, and 
the eomputer arehiteeture that is at hand. Sealars are one-dimensional veetors. They are 
represented using higher dimensional veetors in eomputers: the sealar 46260 is represented 
using a 16bit integer, whieh is a 16-dimensional binary veetor. In the next seetions, we will 
talk briefly about the basie features underlying the brain’s representation models, and then 
generalize to what our model should ineorporate. 

3.1 Dimensionality 

A huge number of neurons and synapses form the brain’s eireuitry. Eaeh neuron is eonneeted 
to tens of thousands of other neurons. There are hundreds of thousands of neurons that fire 
eaeh seeond. The amount of information that is being exehanged inside our brains, between 
neurons, and between regions, is extremely large. To mateh sueh high eontent and rate of 
exehange, we will shy away from sealars and low-dimensional veetors. The information in 
the brain is eneoded on larger elements than these. This leads us to explore the possibility of 
high-dimensional veetor use, veetors that are thousands of bits long[37]. 

A high-dimensional veetor is represented in a high-dimensional spaee, whieh groups 
all the veetors of one representation together. Just like a 2-dimensional veetor is represented 
in a 2 dimensional spaee, an N-dimensional veetor is represented in an N- dimensional spaee. 
High dimensional modeling started off several deeades baek under the forms of artifieial 
neural networks, parallel distributed proeessing and eonneetionism. We will be exploring all 
these areas later in this review. First however, let us expand a bit on these high-dimensional 
representations. 

3.2 High-dimensional representations 

The spaee of representations make up the sets of units with whieh a eomputer eomputes. 
This spaee is low-dimensional in eonventional eomputers, where the memory is usually 
addressed in 8-bit bytes, and the ALU applies its arithmetie operations on 32-bit veetors. 
The high-dimensional representational veetors and spaee aren’t neeessarily binary. They 
ean be ternary, real, or eomplex. They ean also be further modulated using probabilistie 
distributions, limited ranges of values, and sparse modeling. Sparse modeling eorresponds 
to loosely eoupled systems. Many eomponents are either zero, or negligibly small. This 
saves a large amounts of memory, and speeds up data proeessing, while maintaining the 


8 



advantages of high-dimensionality[35]. Before advancing any further, let us take an example 
of a low-dimensional vector. 

Let us consider a 3 bit string: 


( 1 . 0 , 1 ) 

This string is also considered to be a 3 — dimensional vector. 

It can be represented in a 3 — dimensional space. 

Such a three-dimensional space can contain 2^ = 8 patterns. 

To showcase the difference in orders of magnitude between a low, and high-dimensional 
representation, let us now take an example of a high-dimensional binary vector. 

Let us consider a 20,000 bit string: 

(1,0,0,1,0,1,1, 0,0,1...etc) 

This string is also considered to be a 20,000 — dimensional vector. 

It can be represented in a 20,000 — dimensional space. 

A binary 20,000 — dimensional space consists of: 

220,000 _ 2 gg 20®°^° such patterns. 

Because of the discrete nature of vector representations, all vectors are equally spaced. 
Distances can be measured between points using the Euclidean metric, or the Hamming 
distance. The Hamming distance is the number of places at which two binary vectors differ. 
Between two 20,000-dimensional vectors, the Hamming distance cannot exceed 20,000 bits. 
This can be normalized to 1, in Hamming distance, by dividing it by the dimensionality 
of the vector. Because of the nature of the representational space, the large majority of 
distances between two vectors are concentrated at 0.5 (or 10,000 bits in our example). In 
statistical terms, picking two random vectors, there is over 99,99% chance that the distance 
separating them is between 0.47 and 0.53. If we take a third vector, it will also differ from 
the first two by around 10,000 bits. These vectors are considered as unrelated, and are said 
to have a dot product between them of mean zero. 

Because of these characteristics, if we were to take a large number of vectors, such 
as 10® vectors that are 20,000-dimensional, as an example, this will still be a very small 
percentage of the total number of vectors in vector space. Noisy vectors would still be 
identifiable among the large number of chosen vectors, as long as the chosen vectors are all 
randomly spaced between each others. Related, or similar, concepts can be coded in vectors 
that have smaller Hamming distances separating them. This would make it so if the noise 
was too great, and we were unable to recuperate the original vector, we would at the very 
least obtain a closely matching, or similar, one. 


9 



A cognitive system sueh as the brain ineludes several representational spaees, of different 
dimensions eaeh, whieh are used for different tasks. As we will see later in this review, some 
neurons exhibit the eharaeteristies of low-dimensional veetors, while others demonstrate 
properties of high-dimensional representations. 

3.3 High-dimensional memory 

The memory stores the data, and set of instruetions neeessary to run, and manipulate a 
program. A elassieal memory arehiteeture is made up of memory loeations, whieh are 
an array of addressable registers, eaeh of whieh hold a string of bits. The eontents of the 
memory loeations is made available for use by probing the memory using the loeation’s 
address, whieh is also a string of bits. In our ease, a memory that would store 20,000 bit 
veetors would have to be addressed by 20,000 bit veetors, and would have a total of 220,000 
memory loeations in order to store veetor spaee as a whole. 

This overflow of memory loeations would make it possible to retrieve a partieular set 
of data using it’s address, or even an approximation of its address through noise and error. 
There are two storage modes, eaeh with distinet properties and advantages. 

3.3.1 Heteroassociative and Autoassociative memory 

Heteroassoeiative memory is based on the meehanism of storing a memory p using the 
pattern of a as an address, p ean thereafter be retrieved using either a, or a noisy approxi¬ 
mation of a, sueh as a', a". ..etc. This type of memory makes it possible to store sequenees 
within memories and their addresses. By making a the address for the memory p, and fj, the 
address for the memory M, we ean sequentially reeall memories jj, and M using only a as 
an initial address. 

First eoined as Correlation Matrix memories[39], this type of arehiteeture was not 
able to eorreetly perform a reeall task using noisy addresses. Hetero-assoeiative memories 
saw a big leap with Bidireetional Assoeiative Memories[40]. In this model, subtle noise 
in memory addresses eould be overeome, and memory reeall worked with forwards, and 
baekwards passes, by storing the memory in veetor pairs. 

The other type of memory storage is autoassoeiative memory. Autoassoeiative memory is 
aehieved by storing a memory p using p itself as the address. Just as previously deseribed, p 
ean be reealled by p, or by a noisy version of p, sueh as p', /i"...ete. This type of addressing 
therefore works as a noise filtering teehnique. It is used in several eontent-addressable 
memory models, sueh as the Hopfield network[32], whieh guarantee a eonvergenee to a 
pure, non-noisy, memory, through several iterations. 

Going from jj,'", to a less noisy p", to p', we ean then finally obtain the stored memory 
fx. However, depending on the address used, its proximity with the initial value, and how 
noisy a partieular veetor is, eonvergenee to a false pattern eould still oeeur. 


10 



4 The brain and diversity 


There are two big types of neurons that are present in the brain, each of which has their own 
representations. The first type of neurons perform extremely simple and specific tasks, such 
as color recognition or space orientation. These neurons behave as bits, or low-dimensional 
vectors. Their firing is straight forward and predictable using what we know today as basis 
for their computation algorithms. 

The other type of neurons is mixed selectivity neurons. These neurons are most notably 
present in the thinking, planning and higher order regions of the brain. Unlike classical 
selectivity neurons, these don’t respond exclusively to one stimulus or task. Instead, they 
react in different ways when confronted with a wide variety of stimuli[3]. They are essential 
for complex cognitive tasks and give us a computational boost and a cognitive flexibility. 
Up until very recently, it was thought that these neurons fired in random, stochastic waves. 
Their behavior did not conform to what we knew about neurons. 


4.1 Mixed selectivity 

Mixed selectivity neurons are different from classical selectivity neurons. They are nonlinear, 
highly heterogeneous, seemingly disordered, and difficult to interpret. However various 
teams are trying to reverse engineer their structures, by collecting data through various 
neurophysiological experiments. In order to observe these features, single neuron recordings 
must be performed in situ. Because of the invasive nature of the procedure, experiments 
have been restricted only to behaving animals, particularly to the prefrontal cortical region, 
a region known for its implications in thought, behavior and decision making. 

The results seem to indicate that the populations of mixed selectivity neurons encode 
distributed information. This feature was not observed in the classical neurons, which are 
confined to single tasks. 

The neural representations are also high-dimensional, where the dimensionality represents 
the firing rate of the neurons. This makes it possible for these neurons to have a large set of 
input to output relations. Through these features, neurons can generate a rich set of dynamics 
and task solving capabilities [66]. The nonlinear mixed selectivity is sufficiently diverse 
across neurons that the information about task types could be extracted from the covariance 
between neural responses. Even when the classical individual neurons were experimentally 
devoid of any useful information, the knowledge and task solving capabilities of the primate 
were conserved. In order to remove classical selectivity, noise was added to every classical 
neuron, equalizing their average responses, and nullifying their effects. After removing this 
classical selectivity, a large drop in accuracy was observed in the early stages of the trial. 
However the accuracy increased later on as the trial progressed, and more information was 
gathered [67]. 


11 



It is interesting to emphasize that these same properties were observable using high¬ 
dimensional representations, simulated on a elassieal Von Neumann maehine using binary 
veetors. It’s essential to get a elose look into the brain’s eireuitry and representations 
through neuroseienee. But it is equally important to then learn to abstraet away from the 
neurotransmitters, membrane potentials, spike trains, and other physieal attributes of the 
brain, and instead taekle the resulting behaviors. This will help us understand the underlying 
mathematieal prineiples based on them. In other words, the logieal design must be separated 
from neural realization. 

4.2 Robustness 

It is widely known that humans lose gray matter as they age. In faet, it’s been observed that 
there’s a physiologieal 10% to 30% reduetion of the eortieal gray matter density, in normal 
individuals, between ages 7 and 60[78]. While some individuals may lose as mueh as l/3rd 
of their brain, this goes on with little to no pereeivable ehange. With all these variations, 
and the signifieant day-to-day loss of neurons, our every day life goes by uninterrupted. The 
neural arehiteeture is therefore very tolerant to eomponent failure. That is, unlike elassieal 
binary representations in eomputers, where even a single bit of differenee eould either result 
in system errors, or lead to signiheantly different results. 

Redundant representations are a hallmark of high-dimensional veetors. The higher the 
dimensionality, the higher will the proportion of allowable errors be. As diseussed earlier, 
patterns ean differ by a eertain number of units, yet they eould still be eonsidered equivalent 
if the representational-spaee was large enough. 

For maximal robustness, and an effieient use of redundaney, the information should be 
distributed equally among all units. This would lead to a steady degradation of the stored 
information. This degradation is proportional to the number of failing bits, irrespeetive of 
their types or positions. While we would like to employ a robust arehiteeture, it is also 
essential to take into aeeount the plastieity that the brain exhibits. Modeling is therefore 
done in a holistie manner, whieh is a perfeet trade off between robustness and plastieity, 
optimizing neural resourees and redundaney[85]. 

4.3 Randomness 

Brains are highly struetured organs, but many details are left to randomness. Network 
formation is not very eonstrained, with a strong eomponent of randomness reigning on 
it ever sinee its genesis. The distanee between nodes, and their eonneetions follow an 
independent random distribution, making no two brains strueturally identieal[71]. Any two 
brains are therefore ineompatible at the level of hardware and internal patterns. 

Not only is there a struetural randomness, but also eonstant spontaneous neuronal 
quantal release. This is due to the random probability assoeiated with transmitter release[91]. 


12 



Random potentials are therefore stoehastieally ereated, and are known as synaptie noise. 
Synaptie noise ean aid, or impair, various features sueh as signal deteetion, neural perfor- 
manee, and may shape eertain patterns. 

For a eorreet neural simulation of the brain, the system should have randomness ingrained in 
it. The model should be built using stoehastie patterns and eonneetions, by drawing veetors 
in a pseudo-random fashion. The aetual patterns themselves aren’t of any signiheanee. What 
does matter for eomputation and eonseiousness are the relations between patterns within 
the system. Patterns for two similar eoneepts should be similar in one system, whereas 
two similar, or identieal, eoneepts should theoretieally have different patterns in different 
systems. 


5 Computationalism, and Connectionism 


Researeh on intelligent systems has split into two different systems, the elassieal symbolie 
artiheial intelligenee paradigm, also known as eomputationalism, and the eonneetionist AI 
paradigm. 

5.1 Computationalism 

Symbolie AI is at the forefront of eomputationalism. It maintains that the eorreet level at 
whieh we should model intelligent systems, sueh as behaving animals, or the human mind, 
is through symbols. A symbol is an entity that ean have arbitrary designations, sueh as 
objeets, events, and relations between objeets, events, or both. Eaeh symbol eorresponds to 
one, and only one, entity at any given time.[55] Symbols may be eombined together to form 
symbol struetures, sueh as expressions or sequenees. In a eonventional eomputer, a symbol 
is represented by a variable, whieh ean be instantiated, by making it refer to a speeihe entity, 
or uninstantiated, where it doesn’t refer to any speeihe entity, and ean therefore be rebound. 

5.2 Connectionism 

Conneetionism is split into two parts: loealist representations, and distributed representations. 
Eaeh of them has their respeetive advantages and short-eomings. We will be detailing them 
in the upeoming seetion. 

5.2.1 Loealist representations 

In loealist representations, a single unit, or neuron, represents a single eoneept. It is relatively 
elose to the idea of symbolie AI, exeept instead of binding objeets to units, we are binding 


13 



features of objeets to units. This gives mueh more freedom in representing entities with 
similar properties, but eomes at the eost of harder variable binding, and a longer proeessing 
time. 

Variable binding ean be aehieved in several ways, most notably using signature prop- 
agation[88]. A separate node is alloeated for eaeh variable assoeiated with eaeh entity, 
and a value is alloeated to represent a partieular objeet, it is the signature of the objeet it 
expresses. The signature ean then be propagated from one node to the others in feed-forward, 
or baekward ehaining proeesses depending on the task at hand. Signatures will also need 
to be manipulated during state transitions, eertain aetions or orders, and variable rebinding. 
This ean be done through veetor addition, or multiplieation. 

Another way to bind variables in loealist representations is through phase synehro- 
nization, whieh utilizes the temporal aspeets of aetivations. An aetivation eyele is divided 
into multiple phases. Eaeh node fires at a partieular phase, and ean be either an objeet 
node, whieh holds a eertain value, or an argument node, whieh holds a variable. Eaeh phase 
represents a different objeet involved in reasoning. Objeets have their own assigned phases, 
and fire in a eonsistent manner. When an argument node fires at the same phase as an 
objeet node, the variable is bound to its value. After that dynamie binding is aeeomplished, 
whieh forees the variable node to fire synehronously in phase with the eonstant node it is 
binding[43]. Reasoning is therefore a propagation of temporal patterns of aetivation. 


5.2.2 Distributed representations 

In distributed representations, eaeh entity is represented by a pattern of aetivity that is 
distributed over many units. Eaeh unit may also be used to represent many different 
entities [75]. 

One way to implement distributed representations is through the use of mierofeatures. 
Mierofeatures[26] are a more detailed approaeh to the items of the domain. Unlike the first 
order features used in loealist representation, these are eomposing elements of the features 
deseribed above. Different entities ean be represented through a distinetive pattern over 
these elements, making it possible for two different features, or two different objeets, to 
share some mierofeatures. 

Another way to realize distributed representations, that doesn’t involve the use of 
semantie mierofeatures, is eoarse eoding[69]. Coarse eoding imposes two eonditions on the 
features that are to be assigned to units. The features must be eoarse, that is, they must be 
general enough to be shared by more than one entity, and have an extended definition. The 
assignments of the units must overlap and superpose, in sueh a way that one unit is involved 
in representing several different objeets. This redundaney leads to a highly fault-tolerant 
network. The eoarse eode is a robust representation; the failure of one, or even several 
eomponents will not make the system unusable. These redundaneies also help the system 
provide a higher than normal degree of aeeuraey. Similarities between objeets will therefore 
be refleeted by similarities among their representations. 


14 



As is obvious, systems here work at sub-eoneeptual levels, and entities don’t have a 
preeise, eomplete, or formal deseription. Instead, features are represented by a pattern of 
aetivity over a number of different units, seattered in the network in a non-organized fashion. 
Beeause of this, the system has a eomplex internal strueture that is erueial for its eorreet 
operation. This may be the most optimal way for the system to go through proeessing, 
however it will be very hard for a human operator to deeipher and interpret these struetures. 
As a result, variable binding has proved to be exeeptionally diffieult. A lot of attempts have 
been made, of whieh we will eite a few now, while other more sueeessful attempts will be 
explained in more detail, later in this review. 

The Distributed Conneetionist Produetion System[79], DCPS for short, manipulates sym- 
bolie expressions using eonditional rules. Rules aet as rails, and speeify whieh expressions 
should be moved in, or out, of the working memory of the DCPS. The distributed represen¬ 
tation is done by means of eoarse eoding. However, the main limiting faetor is the presenee 
of only one variable in the whole system. Variable binding is therefore very limited, and 
ean only be performed on a speeial set of terms. By using a memory that is large enough, 
we may avoid immediate limitations by temporarily storing our variables there, at the eost 
of eomputational time. However this is a non optimal solution that does not eneompass all 
use ease seenarios. It eomes at the eost of a severe bottleneek originating from the system, 
trading eomputational time for simplieity, with the laek of variables provided. 

Tensor produet representations [77] have also been used to implement variable binding. 
Veetors are eombined using veetor addition and aperiodie eonvolutions. This ereates a 
more eomplex objeet out of 2 or more basie features. Eaeh objeet is therefore made up 
of a set of variables, whieh are the arguments or roles, and eaeh role has a value, or filler. 
A veetor ean represent either a variable, or a value. Variable binding is aeeomplished by 
forming the tensor produet using aperiodie eonvolution. In order to represent a whole 
entity or item, tensors ean be added together. Pattern matehing ean be earried out using 
these tensor produets, but full variable to value unifieation eannot be aehieved through this 
method. However, this eoneept is the origin of mueh of the modem view on distributed 
representations, as we will be seeing later. 

In order to aehieve an effieient level of binding variables to their values in eonneetionist 
representations, rapid variable binding is essential. This implies ereating variables on the 
fly, and binding them to their eorreet values in a virtually instantaneous manner, before 
deploying them, and using them aeeording to their eontext. However, that eomes with a 
wide range of problem of its own [21]. 


5.3 The necessity of variable binding 

As diseussed, in order to meet the requirements of synaptie plastieity and speed of the 
brain, rapid variable ereation and binding should be performed in distributed representations. 


15 



There have been numerous attempts that were made in order to solve this problem, eaeh 
with its own trade-offs. 

Eliminative eonneetionism was introdueed in order to eontour the problem of rapid 
variable ereation, without taekling it head on. The objeetive of this approaeh was to 
eompletely avoid using loealized or explieit variables. However, in order to mateh the 
brain’s eapaeity to induee and represent identity relations, as well as eoneatenation funetions, 
variables are needed for the manipulation of data. Explieit variables are also needed to 
aehieve a deeent level of generalization, and funetional mapping [44]. The brain has a 
remarkable ability when it eomes to representing relations between variables that are novel 
in our experienee with variables that are familiar to us. Not only that, but humans are able to 
induee a general pattern from a novel test ease in a remarkably fast time. Let us demonstrate 
this through an example: 

Consider the following set of inputs and outputs: 

Input: A car is a // Output: Car 

Input: A house is a // Output: House 

Input: A capacitor is a // Output: Capacitor 

What would the Output be for: 

Input: A juitrekfluit is a // 

The solution to this test ease is apparent. In less than a seeond, a Human subjeet ean hgure 
out the answer to this novel ease, a word whieh he had never seen before. One may argue that 
the semantie meaning assoeiated with these words might bias the human operator, making 
it unfair to eompare him with a maehine that doesn’t have any eapaeity at understanding 
phrases or words. We will therefore present another more involved, and random, example: 

Consider the following set of inputs and outputs: 

Input: jutrika setek bolur hiuna // Output: Bolur 

Input: jutrika setek nyata hiuna // Output: Nyata 

Input: jutrika setekjinski hiuna // Output: Jinski 

What would the Output be for: 

Input: jutrika setek kouyte hiuna // 

This generalization eapaeity, indueed in hundredths of a seeond when faeed with even 
eompletely novel struetures, does not arise from training a network the moment the sample 
data was presented. There is simply not enough time to train a neural network this quiekly, 
espeeially at the level of eomplexity exhibited by the brain. If synaptie weights were indeed 
being ehanged, this ean only be done through Long Term Potentiation, or Short Term Poten¬ 
tiation. Long Term Potentiation requires dendritie spines growth, and therefore takes hours 
to develop. Short Term Potentiation is a transitory alteration of the synaptie vieinity, and 
takes around 10 seeonds to develop. If what was observed here was not due to new ehanges 
in synaptie weight, yet was novel enough to surely not have a neural network speoifieally 


16 





trained to suit it, then it very probably relies on some pre-existing skill, acquired through the 
training of subnetworks within the brain at a past date. These networks are being used with 
their pre-existing variables, to which novel values are being bound rapidly, on the fly. 

A practical solution to this problem may reside in VSAs, or Vector Symbolic Architec¬ 
tures, which permit effective variable binding, and can lead to generalization in the presence 
of suitable algorithms[54]. This group of architectures assumes that each value, and each 
variable is represented by a distributed activation pattern within a vector. An additional 
vector can be used to encode the result of binding a value, to its variable. However even 
in such cases, it is impossible to predict how a network would assign and operate through 
novel tasks on its own, as weight vectors gained through regular training may not suffice for 
generalization to new novel cases[21]. We will be detailing VSAs later in this review. 


6 Artificial Neural Networks 


Artificial neural networks, ANN for short, are an adaptive computational method that learn 
a particular task through a series of examples. They can be used to model various cases 
of nonlinear data. Just so, it happens to be that most physical phenomena are inherently 
nonlinear. This makes artificial neural networks an excellent candidate to model real world 
occurrences. The test samples are multivariate data represented by several features and 
components, that is to say, they are represented by high-dimensional vectors[83]. 

ANNs were once deemed a curse brought onto computer science. They produced 
astonishingly accurate results, but using high-dimensional vectors came at the cost of an 
exponentially growing number of function evaluations. This occurrence was due to the 
lack of optimization of older algorithms, which used exhaustive enumeration strategies[6] 
that were sub-optimal for such tasks. Currently, we are better equipped to deal with high¬ 
dimensional neural networks through dynamic programming[33], conditional distribution of 
features among hidden layers [8], and generally better performing and more optimized neural 
network structures such as recurrent neural networks and restricted Boltzmann machines[l]. 

Old ANN algorithms were developed mostly for low-dimensional computing. Nowa¬ 
days, with the era of massive automatic data collection systematically helping both re¬ 
searchers and programmers obtaining a large number of measurements, new interest has 
been kindled in researching high-dimensional modeling[13]. This motivation was also 
supported by parallel discoveries in neuroscience, which helped generate computer models 
based off of the efficient computational abilities of our brains. The first of such models were 
PDFs. 

6.1 Parallel Distributed Representations 

Parallel distributed processing [46], PDP for short, is a type of artificial neural network that 
took its inspiration from the processing model of the brain. In situ, neural connections are 


17 



distributed in parallel arrays, in addition to serial pathways. Funetions are not eomputed 
sequentially in series, but are rather proeessed eoneurrently and parallelly to eaeh others. 
More than a single neuron is sending signals at a time, and up until the mid-1980s, this was 
a rarely used feature in the digital eomputer. 

As was diseussed earlier, in the Von Neumann arehiteeture, eaeh proeessor is able to 
earry out only one instruetion at a time. In PDFs however, information is represented in a 
distributive manner over different units, and more than one proeess ean oeeur at any given 
time. Memory is not stored explieitly, but rather is expressed in the eonneetions between 
units, similar to how neural synapses work. Learning ean oeeur through gradual ehanges in 
eonneetion strengths through experienee. This system also exhibits all the advantages of 
distributed representations. 


6.2 Restricted Boltzmann Machines 

Restrieted Boltzmann maehines, RBM for short, are a type of artifieial neural networks 
that speeialize in deep-learning[76]. An RBM has only 2 node layers: A visible layer, that 
reeeives input, and a hidden layer, whieh allows the oeeurrenee of probabilistie states. 

An RBM uses a generative probabilistie distribution over a set of inputs, in order to 
extraet relevant features. More than one RBMs ean be staeked on top of eaeh others, in order 
to gradually aggregate groups of features. Eaeh RBM within the staek deals with features 
that are progressively more eomplex. Through these numerous layers, several groups of 
features are eoneeived, whieh help deeode the presented eoneept into smaller models of 
representations. The more staeks there are, the better they handle eomplex tasks. This, 
however, eomes at the eost of eomputational time. A network eomposed of aggregated RBM 
staeks is ealled a deep-belief networks, and is eonsidered to be one of the more prominent 
breakthroughs in artiheial intelligenee researeh and deep learning. 

Eaeh layer eommunieates with both the previous and the subsequent layer. Nodes 
do not eommunieate laterally with other nodes of the same layer. While eaeh hidden layer 
serves as an effeetive hidden layer for the higher nodes, it also aets as a visible, input layer 
for the lower nodes. An RBM is fed one test trial at a time, without any prior knowledge of 
its nature. Learning ean be supervised, where the system infers funetions and features from 
labeled data, or unsupervised, where the network attempts to find a strueture on its own, 
among unlabeled data. The eonneetion weights between two nodes that lead to a eorreet 
eonelusion beeome stronger, while the weights that lead to wrong answers beeome weaker. 

However, restrieted Boltzmann maehines are not deterministie. Given a partieular set of 
units, initial eonditions, inputs, and eonneetions, the network won’t always arrive to the 
same eonelusion, due to its stoehastie nature. The network operates through probabilistie 
distributions over patterns, by doing several passes over the original test set. States are 
updated aeeordingly, with a high probability of the network to eonverge, and a low probability 
for it to diverge[70]. This is unlike deterministie networks, sueh as Hopfield Networks, 


18 



where the network is programmed to always, and consistently, attempt to converge through 
each step. What this leads to is a higher operation time, at the cost of lower chance of getting 
stuck at local minimums. That is, there is a higher chance to find the ’’absolute truth” in 
stochastic models, because their randomness is enough to, almost, force the network to settle 
at the lowest state possible. 

The most widely adopted algorithm for use in restricted Boltzmann machines is 
Markov Chains[12], which are logical circuits, that connect two or more states using guided 
and learned probabilities. Probabilities exiting one state add up to 1. This algorithm is 
sequential, and given one state, it can dissect what the next state would be, and predict 
subsequent states and sequences. 

6.3 Recurrent Neural Networks 

6.3.1 General Overview 

Recurrent neural networks, RNN for short, are a powerful class of artificial neural networks 
where connections between units form a directed cycle. The initial inspiration for this model 
came from neurophysiological observations. Neurons were not only feedforward, they 
were also interconnected. Signals between neurons carried over by all or none principles. 
Thus was bom the concept of RNNs. Nodes are not only linked using forward chaining 
connections, there are also lateral connections, as well as connections from current layers 
into past layers. This binary stream[47] had a remarkable short-term memory for its time, and 
had influences that stretched far beyond neural networks. It even influenced the development 
of the digital computer by Von Neumann. 

The basic architecture of recurrent neural networks is the fully recurrent neural 
network. Each node has directed connections to every other node. There are many models 
and algorithms that seek to better the computation and learning capabilities of RNNs. Apart 
from their use in the computational and mathematical fields, RNNs have experienced a large 
adoption rate in brain modeling. They are not only a way for researchers to reproduce the 
mind’s capabilities digitally, but also a way for them to theorize features which have not been 
experimentally observed yet through the traditional means of neuroscience or psychology. 

6.3.2 Turing Completeness 

RNNs’ inherent representational power made them an attractive candidate for universality, 
otherwise known as Turing Completeness. Turing Completeness can be achieved by proving 
that the system at hand can compute each and every function computable by a Turing 
machine. In other words, this means the system should be able to calculate any recursive 
function, use any recursive language, and solve any problem solvable using any effective 
method of computation or algorithm[81][82]. 

At first, computational universality was achieved using recurrent neural networks 


19 



with a finite number of neurons, and high order connections, by combining their activations 
through multiplicative means[63]. However, this is an unattractive candidate for practical 
use because of the leaps of data that must be calculated. Such a model would consume a 
large amount of memory space and its convergence time would grow exponentially with 
the complexity of a given problem. Such a model is sub-optimal on many different levels. 
Computational universality was also achieved in linear RNNs using additive connections, 
assuming an unbounded, and infinite number of neurons [90]. This is also not practical, as 
having an infinite set of nodes is physically not possible. 

In order to assume Turing completeness, one can simply prove that the system can simulate 
a Turing machine. In the case of finite, low-order, recurrent neural networks, this can be 
done by proving that there exists a network, started from an inactive initial state, that can 
lead to the correct output from an input sequence, using a recursively computable function. 
The input could also be removed entirely, to be then encoded into the initial state of the 
network. This should lead to a valid output, and correct results. Computational universality 
was therefore proved to stand true in low order connection, finite neural networks [74]. An 
estimated total of 886 neurons (870 processor nodes, and 16 input and output nodes) is 
needed to reach this universality. 


6.4 Using an external stack memory 

A recurrent neural network with no hidden layers is capable of learning state machines, and 
is at least as powerful as any multilayer perceptron[19]. It contains an inherent internal 
memory that’s relatively small. This internal memory, stored in weights between the nodes 
is not enough to store all the details that would be necessary for intermediate or long-term 
use. In order to boost memory storage and long term performance, one could theoretically 
use an internal memory stack. However, this would require a large amount of resources, 
and would come in the form of additional nodes and connections, which would in turn 
exponentially increase computational time [64]. An optimal memory stack would therefore 
be external. It would allow for easy modulation, as well as continuous communication with 
the network, which permits the use of classical and continuous optimization methods, such 
as gradient descent. 

An example of this is the Neural Network push-down automaton (NNPDA, [11]). 
This architecture consists of a set of fully recurrent neurons representing the network’s state, 
and permitting classification and training. State neurons receive their knowledge from three 
distinct sources. First off, they receive their information from input neurons, which register 
input fed to the network by sources external to the system. Secondly, from read neurons, 
that interact with the external memory stack, and keep track of the symbols that are on 
top of the stack. And thirdly, they receive feedback from their own recurrent connections, 
seeing as this is a recurrent neural network. The network is connected to the stack through a 
non-linear error function, which approximates results that hold with high probability. 


20 



Although it is possible to learn simple tasks using first-order networks, higher order nets 
are a neeessity for learning more eomplex features. As we’ve grown aeeustomed to when it 
eomes to eomplexifying eomputational algorithms, this eomes at the expense of a longer 
proeessing time, and slowed eonvergenee. First-order networks are by no means eomplete, 
they laek several features required to eorreetly order stored memory, as well as isolators, to 
help fend off any unwanted interferenee. We will see how this problem eould be surpassed 
using LSTMs, and NTMs. 

6.5 Long Short-Term Memory 

As we stated before, reeurrent networks ean use their feedbaek eonneetions in order to store 
representations of reeent input events in the form of aetivations. This is a very useful feature 
when dealing with tasks requiring multiple inquiries over a short-term memory. 

With eonventional algorithms sueh as baek-propagation through time[86], the error 
signal is sealed proportionally to the number of steps it is propagated baek. This leads the 
error to blow up, growing exponentially and leading to oseillating weights, bifureations and 
unstable learning when the error is positive[60]. If the error is negative, this in turn leads to 
vanishing weights, whieh induees exeessive learning times when bridging events through 
long time lags, rendering the modifieation of existing weights a time eonsuming ehallenge 
to be dealt with[7]. The eonventional baekpropagation method is simply too sensitive to 
reeent distraetions. 

6.5.1 The basic architecture 

To avoid the problems that eome with baek-propagation, long short-term memory, LSTM 
for short, was introduoed[29]. It’s a seeond-order reeurrent neural network that enforees 
eonstant error flow, by propagating it through dedieated eonstant error oarrousels(CEC). 
LSTM uses a linear eonstant aetivation unit to store the error, as opposed to the non-linear 
sigmoid aetivation funetions used for the other eomputing units. 

Beeause the eonstant unit follows a different sealing path than the rest of the units 
in the network, units eonneeting to it will be reeeiving eonflieting weight updates. This 
makes effieient learning diffieult to aehieve, and unstable to maintain. To eounteraet this 
phenomena, we add multiplieative input and output gates. The input gates proteet the 
memory eontent stored in the eonstant unit from outside perturbation; the error is therefore 
only edited when it needs to be updated. Output gates proteet the rest of the units in the 
network from the irrelevant information stored inside the eonstant error earrousel; the error 
therefore only affeets the network when it is reealled and needed. The gates are eontext 
sensitive nodes that ean be trained, using gradient deseent, to deeide when to overwrite or 
aeeess the values in the CEC. 

A ’’memory eell” is built around the CEC, using a eentral linear unit, with a fixed 
self-eonneetion, and delimited by the input and output gates diseussed earlier. Several 


21 



memory cells can share the same gates, and form a memory cell block, which is a more 
efficient architecture that facilitates storage. Error inside the memory cell blocks do not 
get scaled when propagated back further in time because of their linear constant activation 
functions. This eliminates exploding or disappearing error values, and stabilizes long-term 
memory. 

This architecture can handle noise, distributed representations, and continuous values, while 
increasing the ability to generalize over untrained values better than older implementations 
of neural networks, including BPTT[86] and RTRL (real-time recurrent leaming[68]). 

6.5.2 Extensions to the basic LSTM model 

LSTM, being one of the more promising RNN architectures of its time, received several 
improvements along the years. Forget gates[17] were added along side the input and output 
gates. Forget gates modulate the amount of activation a memory cell keeps from its previous 
time-step. They are able to force a quick memory dump in memory cells that have completed 
their tasks. This helps in rapidly keeping units effective, and augments the proportion of 
efficiently used and recruited nodes during later computations. 

Another improvement to FSTM which quickly became a staple in computation are peephole 
connections. Peephole connections[18] are links between the three types of gates, and the 
memory cells. In the basic FSTM model, when the output gates are closed, the memory 
cell’s visible activity is null. The information contained inside the CEC is therefore hidden 
from its associated gates, and surrounding nodes, making it impossible to efficiently control 
the information flow into, and out of the memory cell block. These peephole connections en¬ 
able the cell’s unmodulated state to be visible to outside units. This improves the network’s 
performance significantly by reducing random instances of gating and probabilistic editing, 
which used to happen due to insufficient information. 

In order to expand FSTM beyond second-order RNNs, Generalized FSTM 
(FSTM-g, [51]) was introduced. In this architecture, all the units in the network have 
an identical set of operating instructions. A unit relies only on its local environment to 
determine what role it will be fulfilling. Units can be input gates, output gates, forget 
gates, memory cells, or any combination of these. Every unit is also trained using the same 
algorithms. That is in contrast with conventional FSTM, where each type of unit had its 
specific set of algorithms. A number of other changes were also made in order to improve 
flexibility. This helps improve performance, and broadens the applicability of the FSTM 
networks in machine learning. FSTM-g can be shown compared to the regular FSTM 
architecture in Figure 1. 


22 




Cfl 

c 

o 

o 

c 

c 

o 

o 

o 

o 

Q. 

(D 

O 

Q. 


0) 

O 


>» 

o 


E 


CD 


(0 

c 

o 

■c 

OJ 

c 

c 

o 

O 

0) 

o 

Q. 

0 

0 

CL 


(a) LSTM memory block 


(b) Equivalent LSTM-g architecture 



Figure 1 : Comparison between an LSTM memory cell block, and its equivalent in LSTM-g. Black 
lines: Weighted connections. Grey lines: Gated connections. In the LSTM model(a), the input is first 
summed, then gated as a whole by the input gate. The output is also computed as a whole, all subsequent 
units receive the same modulated output. Peephole connections project from an internal stage in the memory 
cell to the controlling gate units, only the final value of the memory cell is visible to other units. In the LSTM- 
g model, the inputs to the memory cell are gated individually before being summed. The output leaves the 
memory cell unmodulated, and is then individually modulated by the output gate to other cells. Peephole 
connections connect to the recurrent units of the memory cell, and are able to see its intermediate state. This 
allows for greater flexibility. Image courtesy of [51]. 


6.6 Long Term memory 

Recurrent neural networks stand out from other machine learning methods due to their ability 
to learn and carry-out complicated transformations of data over extended periods of time. 
But one aspect that was largely neglected in computer science and research was the use of 


23 



































logical flow control and external memory. As we’ve diseussed, RNNs are extremely effeetive 
when it comes to storing recent memories. With the advent of LSTM, intermediate-term 
memory storage became a lot more efficient. However, this did not significantly improve 
networks’ storage capacities. During long, and complicated tasks, networks should have a 
way to store large quantities of information, while efficiently maintaining control over their 
flow. 

6.6.1 Neural Turing Machines 

Neural Turing machines [20], NTM for short, is a neural network architecture that attempts 
to deal with the problems showcased earlier. NTMs are enriched with a large addressable 
memory, that can be trained using gradient descent, to yield a practical mechanism for 
machine learning. For maximum effectiveness, the network will have to operate on rapidly- 
ereated variables, which are data that are quiekly bound to memory slots. Their arehiteeture 
is illustrated in Figure 2. 

Neural Turing Machines interact with the external world using input and output 
vectors, just like all neural networks. However they also interact with a memory matrix, 
using an attentional proeess to seleetively read and write operations. Every eomponent of the 
arehiteeture is differentiable, and instead of addressing a single ehosen element per eyele, the 
network could learn to interact with a variable number of elements in the external memory 
bank, as it sees fit. This is done through blurry read and write operations, determined by an 
attentional focus mechanism. This constrains them, and makes it possible to attend sharply 
to the memory at a single location, or weakly to the memory at many locations. 

Read vectors are stored in the memory matrix’s column, while the number of rows is 
equal to the total number of memory locations. Writing is done in two parts. First using an 
Erase Vector, a full memory location can be entirely, or partially emptied to make room for 
subsequent data. After that, the network uses an Add Vector to write to the memory. Both 
Erase and Add vectors are differentiable, computing their current state from their previous 
state, weight vectors and input orders. The presence of two separate vectors, both being 
independently subject to learning, allows a fine-grained eontrol over whieh elements in each 
memory location are modified and how. 

The weightings arise from the combination of content-based addressing and location-based 
addressing. Content-based addressing focuses attention based on the similarity between the 
calculated, noisy values, and the noiseless values present in the memory space. This makes 
it possible to use approximations to speed up calculations, leaving it to the autoassociative 
memory to retrieve the exaet value and elean up the noise[32]. Loeation-based addressing is 
used when values with arbitrary content are employed. Since it is not possible to predict 
these values, yet they still need a recognizable name and address, the values are addressed 
by location, and not by content. This is designed to facilitate both simple iterations across 
the locations, and random-access jumps[31]. The content of a memory location could also 
include information about the location of another value inside it. This allows the focus to 


24 



External Input 


External Output 



// \ 


Read Heads 


Write Heads 


I 


Memory 


Figure 2: Neural Turing Machine architecture. The controller, or neural network, receives the input from 
an external environment, and emits an output in response. The controller is also able to read, and write from 
a memory matrix, using read, and write heads. Image courtesy of [20]. 


jump, for example, to a loeation next to an address aeeessed by eontent. 

Either a feedforward, or reeurrent network can be used here, each with its own 
advantages. While the RNN allows the controller to mix information across several time- 
steps of operation and permits a larger number of read and write heads, a feedforward 
controller confers greater transparency of the networks operation, because its patterns are 
easier to interpret by a human operator. This transparency comes at the cost of an imposed 
bottleneck due to the reduced number of read and write heads. 

This network architecture leaves several free parameters for the operator to decide upon, 
such as the size of the memory matrix, the range of location shifts in the location-based 
addressing, the number of read and write heads, as well as the type of the controller network. 
All of these parameters serve to make the Neural Turing Machines [20] as adaptable as 
possible for a number of different tasks. 


25 












6.6.2 Memory Networks 


Another solution to the laek of long-term memory in neural networks are Memory Net- 
works[87]. This arehiteeture was eoneeived around the same time as the Neural Turing 
Maehines deseribed earlier, by a different, and independent researeh team, indieating a 
reeent surge in interest in the use of external memories as an extension to neural networks. 
The model operates in similar ways to the NTM, by eombining maehine learning inferenee 
with an external memory eomponent that ean be read and written to. This is done through 
four learned eomponents, them being an input feature map, an output feature map, a response 
element, that eonverts the output into the desired format, and a generalization eomponent, 
whieh updates old memories given the new input and the eurrent state of the network. 


7 Vector Symbolic Architectures 

Veetor symbolie arehiteetures, VSA for short, are high-dimensional veetor representations 
of objeets, relations and sequenees, used in maehine learning. A good representation helps 
systems eompute orders of magnitude faster, making possible what would otherwise have 
taken a near infinite amount of time. An example of a VSA that was briefly diseussed 
earlier is the tensor produet model[77]. We will also be diseussing holographie redueed 
representations[62] in an upeoming seetion, one of the more notable and widely used VSA 
arehiteetures. 

7.1 Vector components 

Components of veetors ean be binary, sparse, eontinuous or eomplex. Veetor operations 
sueh as addition, binding and permutation ean be used to group desired elements together. 
Sinee we are looking for a representation that is speeifieally suited for optimizing maehine 
learning, a number of eonstraints will have to be employed. With the goal to represent 
potentially hundreds of thousands of objeets at a time, a distributed representation will 
have to be used, otherwise storing eaeh eomponent in its independent veetor would not be 
suffieient. 

Similar objeets and struetures will have to be mapped to similar veetors, having a 
small euelidean (or Hamming) distanee separating them. This makes it so that similar 
veetors will have a signifieantly greater dot produet than 2 randomly ehosen veetors. Sueh 
a design deeision would help when filtering out the error and noise generated during the 
approximation and eomputation phases, ensuring that a word’s meaning, or a pixel’s eolor, 
doesn’t vary too mueh aeross iterations and during reeovery. This also gives the network a 
better ability to generalize, by finding approximate matehes to novel eoneepts, and using its 
past knowledge to determine the implieations of new entities. 

Another important eonstraint to eonsider is the use of fixed length veetors. The goal 


26 



here is to ineorporate VS As with standard maehine learning algorithms, sueh as Struetured 
Classification, where inputs and outputs are most conveniently expressed as vectors with a 
pre-specified number of components. Sequences, such as sentences, however don’t have a 
fixed number of components, or a fixed structure. In the specific example of sentences, a 
potential solution to that problem would be using part-of-speech tagging, chunking, named 
entity recognition and semantic role labeling[10]. This helps us convert sentences into 
unambiguous data structures using the syntactic, and semantic information contained in 
them. Vectors’ dimensions should be tuned for maximum efficiency. Larger vectors have 
more storage capacities, while smaller vectors decrease the system’s computation time and 
make for faster network convergence. 

7.2 Vector operations 

All vectors being of the same size and dimension enable us to use the wide array of pre¬ 
existing vector operations and algorithms, in order to combine and structuralize objects. 
One such operation is the classical vector addition operator. Addition is commutative, and 
encodes an unordered set of vectors into a single vector of the same dimension. A structure, 
or sentence, could therefore be encoded in a single normalized vector, consisting of the sum 
of the individual word vectors. A different sentence formed using the same word vectors 
would also have the same resulting sum. Unlike scalar addition, vector addition preserves 
constituent recognition, which enables information retrieval[9]. If the dot product of the 
resulting sum vector with the constituent we are looking for, is close to the dimension of 
the vector, then that element is present in the vector. In order to better demonstrate this 
operation, let us consider a brief example: 

Let us consider the set of n-dimensional vectors: 

a, b, c, d, e, and F. 

We would like to check if the vector V is one of the vectors 
composing F. 

For the purpose of demonstration, let us consider: 

V = b 

V eF 

This is done by computing the dot product of the vectors 
U and F : 

F.V = {a + b + c + d + e)*V 


27 



Multiplication is distributive over addition: 

F.V = {a + c + d + e).V + b.V 

Similar vectors have a dot product that is almost equal to 
their dimension n. 


F.V =< mean 0 noise > +n 


If the dot product of the sum vector with the tested vector is equal to, or greater, than the 
vector’s dimension, this would indicate that the tested vector is one of the vectors that was 
used for the sum. However it is not possible to know exactly which vector that is, or where 
its position was. 

Since vector addition stores only unordered sets of vectors, relying on it is not sufficient 
to encode information structure and order. We need a way to bind objects in a structure 
together, while retaining a certain degree of fiexibility when it comes to moving them around. 
We want to bind words in a sentence together, while retaining the inherent meaning that 
comes with them if we change the order of the words in certain ways[80]. 

For that, we use the binding operator, to bind a sum of vectors as part of a structure 
description, by multiplying the sum vector by a matrix. We use the features described 
previously [10] to identify chunks of data that maintain coherence even when shuffled, then 
bind them to a matrix, or sub-matrix, depending on their relevance in the general order of 
things. Component recognition is done the same way in vector binding as it was using the 
addition operator. 

By using multiple binding matrices, we can bind several batches, chunks, or objects 
into a single vector. We can then use the preserved recognition feature to determine if a 
certain element appears in any of them, if 2 elements appear together in the same matrix, or 
if they appear in the same global structure, but in different chunks. This helps subsequent 
learning and alleviates the ambiguity issues faced when using only vector addition. 

In order to increase robustness, we can encode structure descriptions several times 
in a vector, by adding them. Structures can be represented by sequentially binding their 
components to matrices, in order to fix their precise order. This gives the network the 
ability to not only generalize to a novel structure that is similar to a the inputs, but also the 
ability to know how similar certain structures are, or if they were exact replicas, element 
and order-wise[16]. 

7.3 Holographic Reduced Representations: 

A problem with representing concepts using connectionist networks is that items and 
associations are represented in different spaces. In order for part-whole hierarchies, which 


28 



will be detailed later, to work as intended, we should employ a vector that acts as a reduced 
description of a set of vectors, and combines their characteristics into one, for easier 
manipulation. For this, and several other reasons stated earlier. Holographic Reduced 
Representations, HRR for short are used. They are constructs of the association of vectors 
in a reduced and compact fashion. The result of an association, of two or more vectors of a 
given dimension, is a vector with the same dimension[62]. 

7.3.1 Matrix memory 

Traditional matrix memories, are simple, and have high capacities[89]. They use three basic 
operations for their storage needs. First off, they use encoding, where two item vectors 
result in a memory trace, or matrix. Secondly, they can use memory trace composition, 
through addition, or superposition of several memory traces, in order to compress more than 
two matrices, or more than two vectors, into a single matrix. Thirdly, they use a decoding 
process, where a memory trace and an item vector are used to give the other item vector. To 
better illustrate these ideas, let us showcase some vector operations. 

Basic vector operations. Let us consider: 

V: Space of Vectors representing items. 

M: Space of Matrices representing memory traces. 


Kl: Encoding operation, used to transform two vectors into a matrix: 

VmV = M 


ffl: Trace composition operation, composes two memory traces into 
one: 

MSM = M 

[>: Decoding operation, used to retrieve a Vector from a Matrix, 
using a cue vector: 


V > M = V 


Memory traces are very flexible. They can represent a number of associations, and composi¬ 
tions through pairs. Any item from any pair can be recovered, using the other vector it was 
associated to as cue. 


29 



Let a, b, c, d, e and f, be Vectors representing items. 


The vectors a and b can be encoded into a matrix: 

aMb = M 

Let us demonstrate memory trace composition: 

{amb)m{cmd)m{em f) =c 

Using any vector as a decoding cue will recover the other vector of 
the pair: 

e>C = f 


Noise increases as more vectors are stored in a single memory trace. This leads us towards 
using autoassociative memory for noise and error filtering. However, the main issue with 
using such a representation is the exponential loss of space as dimensionality increases. To 
store N-dimensional vectors, O(iV^) units will be needed to represent the N x N matrix. 
This calls for new ways, to represent vectors in memory cells using more constrict algorithms, 
for better space efficiency. 

7.3.2 Aperiodic convolution 

Past attempts at convolutions used aperiodic convolution for associations[77]. Two vectors, 
of N elements each, are convoluted into a vector of 2N — 1 units. This is done by summing 
the trans-diagonals of the outer product of the two vectors. This vector grows with recursive 
convolution, as more vectors are added to it. For n vectors of N elements each being 
aperiodically convoluted, the resulting vector will be [n * (iV — 1) + 1]-dimensional, which 
comes back to using O(iV^) units, if n and N are equal and/or infinite. While non optimal 
compromises can be made, such as limiting the depth of the composition, or discarding 
elements outside the central ones [14], the growth problem can be avoided entirely by using 
circular convolution. 

7.3.3 Circular convolution 

The circular convolution[62] of two iV-dimensional vectors, is an iV-dimensional 
vector, and can be considered as a compressed outer product. The circular 
convolution of n iV-dimensional vectors is one iV-dimensional vector, occupying 
0{N) units. A representation of circular convolution is displayed in Figure 3. 


30 



Circular correlation is the approximate inverse of circular con¬ 
volution, and will be able to decode a memory trace out of a 
vector cue, in order to give back the second item vector. How¬ 
ever correlation only gives an approximate, noisy, version of 
the original item vector, which necessitates the use of autoas- 
sociative memory to filter out the noise. Circular convolution 
can also be approximated using Fast Fourier Transforms, for 
faster vector rendering. 

In all of these methods, to successfully store a vector, we 
only need to store enough information to discriminate it from 
the other vectors. This property makes such compressions pos¬ 
sible, up to a certain limit. The capacity of the memory model 
is the number of associations that can be presented usefully in 
a single memory trace. 


Figure 3: Circular Convolution 

represented as a compressed outer 
product. Image courtesy of [61]. 



22 

z = xtgy 


20 = ^oyo + X2yi + xiy2 

21 = Xiyo + xoyi + X2t/2 

22 = X 2 yo -f- XlJ/l + Xoy2 


Convolutions can also be used for variable binding, where the variable and the value are 
bound together, or representing sequences. Sequences can be represented in one memory 
trace, or divided into several memory traces depending on their lengths, and the desired 
accuracy[53]. The decoding process is through chaining, where the cue vector at hand will 
be used to start a cascade of retrievals, correlating the trace with the current item, until the 
end vector is reached[52]. In some cases, where elements could be shared among different 
structures using only one past element of the sequence may not be enough. It is therefore 
also possible to encode several units into the past for a more robust sequencing. To better 
demonstrate these concepts, let us provide an example of them: 


®: Circular convolution operator. 

a®b = V 

®; Circular correlation operator. 

b@V = a 

Sequences can be stored in several ways. 

Non redundant storage: 

Xi ® (2 -f X2 ® b X^ ® c = S 


31 



Fully redundant storage, retrieval cue must be built from the ground 
up using correlations. 

a + a®b + a®b®c = S 

Semi-redundant storage, conserves a bit of redundancy, and gives 
much more freedom for retrieval operations. 

Xi®a + a®b + X 2 ®b + b®c + X 3 ®c=S 

We can restore the memory c from S using X3 as a cue: 


Xs@S = c 


Using correlation, and the relevant cue vectors, we can then retrieve our stored information 
from the memory network. The retrieved vectors will be noisy, so an autoassociative memory 
would go best with this storage method. 

8 Working memory 

Working memory is a concept of human cognition, which aims at explaining our performance 
in tasks involving the manipulation of short-term information. The limitations of working 
memory have been thoroughly studied in psychology. Conclusive results found that memory 
was stored in chunks of information. The number of chunks that can be recalled at any one 
time was determined to be 7 ± 2 [49]. Further explorations found that the recall process 
depends on several other factors, including word length, syllables and prior familiarity 
with the words[34]. Training, however, did not seem to affect these chunks. The observed 
performance boosts were rather due to how chunks of information were manipulated. The 
biological cognitive terrain for working memory is a tripartite architecture, formed of three 
functionally complementary systems [4]. 

8.1 Biological architecture 

The pre-frontal cortex, posterior parietal cortex, and hippocampal system are the three main 
components of working memory. Let us explore them briefly. 

The pre-frontal cortex, PFC for short, specializes in the active maintenance of the 
internal contextual information. It is dynamically and actively updated by the basal ganglia. 
Basal ganglia, BG for short, are a group of subcortical gray matter, interconnected with 


32 



various structures in the brain, such as the cortex, thalamus and brain stem. The PFC-BG 
system relies on dopamine based learning, and is able to bias on-going processing throughout 
the cerebral cortex. This is done in order to maintain information that is currently relevant, 
as well as fend off interference, and steer the attentional processes. This helps in performing 
task relevant processes in the face of multicentric waves of knowledge, while reducing the 
influence of other cortical regions which may not be relevant to the specific task at hand[48]. 

The posterior parietal cortex system performs automatic sensory and motor processing 
in the brain by relying on statistically accumulated knowledge. This system exhibits slow 
and integrative learning, through skills that are gained over long-term training. This long 
term processing power is believed to reside in the angular gyrus. It incorporates memory 
tasks involving extended periods of training and use, such as language processing, reading, 
writing and arithmetic operations[73]. 

The hippocampal system is a compound structure in the medial temporal lobe of the 
brain. It is composed of several structures, the most important of which are the hippocampus, 
and the entorhinal cortex. The entorhinal cortex is both the main input and output to the 
hippocampus. This system is specialized for rapid learning after short, or non-recurrent trials, 
as well as spacial localization and navigation. It binds together arbitrary information, which 
can be recalled by other neurological structures[42]. This structure also helps consolidate 
information from short-term memory, to long-term memory, through a process called Long 
Term Potentiation, LTP for short. 


8.2 Memory functions 

These regions support basic memory functions associated with working memory. Among 
the most important ones are partial recalls of on going processes, as well as controlling 
processing functions. They work as a central executive unit by supervising and controlling 
the flow of information into, and out of the system. The second of these emerges from the 
biasing influence by the pre-frontal cortex, which is actively maintained and updated by the 
basal ganglia, on the rest of the system. These two different functions are performed through 
the same neurological path, and distributed in a stable configuration throughout the cortex. 

We have previously showcased two different learning mechanisms, rapid learning, and 
integrative learning. These operations require different types of neurons, some being rapid 
learners, while others being integrative statistical learners. There have been researcher 
that have attempted to model these neural structures, including replicating their functions 
and performance in basic memory recall tests such as the AX-CPT. Seeing as the PFC-BG 
system is the most important of these in both short lag memory and controlling other systems, 
we will be exploring some of its models in later sections. 

There are several demands to creating a functional pre-frontal cortex, and basal ganglia 
model[23]. The system should be rapidly updating, and able to encode and maintain new 
information as it occurs. It must also have the ability to selectively update information, by 


33 



knowing which elements to maintain in the faee of interferenee, and whieh ones to update 
from its previous iteration. The system should have the power to eondition responses in other 
parts of the network, by learning when to gate appropriately. As you have eome to notiee, 
these are all points that have been diseussed previously in their own eontext, through BPTT, 
LSTM, NTM and PDP, but we will be seeing them eombined here, in light of eognitive 
memory proeessing. 

But first, let us taekle the details of the PFC-BG operation, as well as its neurologieal 
struetures and eonneetions, in order to better understand how one goes about implementing 
sueh a system eomputationally. 


8.3 Memory operations in the pre-frontal cortex and basal ganglia 

The main feature of neurons in the pre-frontal eortex is their rapid update eyeles. This 
oeeurs when Go neurons, in the dorsal striatum of the basal ganglia, fire. These Go units are 
medium spiny GABAergie neurons, whieh inhibit the substantia nigra pars retieulata. This 
leads to it releasing its tonie inhibition of the thalamus. The thalamus, being disinhibited, 
enables a loop of exeitation into the pre-frontal eortex, whieh toggles the state of the bistable 
neurons eontained within. The Go neurons are in direet eompetition with NoGo neurons. 
The NoGo neurons are also present in the dorsal striatum, and eounteraet the effeet of 
Go neurons by inhibiting the external global pallidus. This exeites the substantia nigra 
pars retieulata, whieh promotes the inhibition of the thalamus, therefore proteeting the 
information stored, and eontributing towards robustness in the faee of interferenee[50]. The 
PFC-BG arehiteeture is visible in Figure 4. 

The basal ganglia is eonneeted with the pre-frontal eortex through a series of parallel 
loops. There are multiple separate representations, working as inner and outer loops. These 
loops give the PFC-BG eomplex the ability to proeess several layers of information in 
parallel. An outer loop will have to be maintained, serving as the framework for the 
inner loop’s eonditional statements. The outer loop is stable and has a slow rate of varianee, 
whereas the inner loop goes through rapid iterations of ehanges. In order to better understand 
this meehanism, and its potential eharaeteristies, let us illustrate its applieation through the 
use of a Continuous Performanee Test, the AX-CPT: 

The AX-CPT task: 


Subjects are presented with a random sequential stimuli of 
flashing letters: 


AorB, followed byXorY. 


34 



The test sequence is of the following nature: 


A-Y-B-X-A-X-A - Y...etc 


The prior stimulus should be maintained over a delay, until the 
next stimulus is presented. 

The target sequence is A — X. A dedicated button should 
be pressed by the subject whenever he encounters it. 

When any other sequence is detected, such as: A — Y, B — X or 
B — Y, another button should be pressed. 

The first letter is therefore held in an outer loop, around 
an inner loop which houses the second letter. This is required to 
detect the target sequence. 

These results are the fruit of several deduetive studies performed on behaving ani¬ 
mals that exhibit human-like memory performanees, as well as humans. However, multiple 
inner loops ean eo-exist in one outer loop. An inner loop eould also play the role of an 
outer loop, and house an inner loop of its own, resulting in a staek of inner loops. In order 
to demonstrate that, let us present a more eomplieated and reeent version of this test, the 
1-2-AX-CPT. 


The 1-2-AX-CPT task: 


A 1 or 2 stimuli is added upstream of the regular A/B — X/Y 
stim u I i. 


The test sequence therefore becomes: 

l-A-Y-B-X-2-A-X-B-Y-A- Y...etc 


The target sequence here is variable: 

• If the subject last saw a 1: A — X is the target sequence. 

• If the subject last saw a 2: B — Y is the target sequence. 

The task defines an outer loop, 1 or 2, of active 


35 



maintenance, within which resides a stack of inner loops. 


Each of these inner loops is formed by an outer loop, A or B, 
and an inner loop, X or Y. 


A representation of the inner and outer loops is shown in Figure 5. In order to prevent 
interferenee between different loops, this task is aehieved through the unique arehiteeture of 
the PFC. This is done through isolated, self-eonneeted, stripes of intereonneeted neurons, 
that are eapable of sustained hre, and aetive maintenanee of the working memory. Seleetive 
updating of these stripes oeeurs through parallel loops that ean be overwritten independently, 
in different areas of the basal ganglia and the pre-frontal eortex[2]. 

Using a dopamine-based learning meehanism, the Go and NoGo neurons are trained to 
appropriately gate the aetivity of other neurons in the striatum. Dopamine aets on these eells 
using different types of reeeptors, D1 and D2, respeetively for Go and NoGo neurons. The 
Dl-like family (D1 and D5) aetivate the adenyl eyelase, inereasing intra-eellular eAMP levels 
during positive reinforeement. The D2-like family (D2, D3, and D4) inhibit the formation 
of eAMP during negative learning reinforeement[25]. Eaeh neuron develops its own unique 
pattern of weights. eAMP in turn regulates neuronal growth and development, and mediates 
some behavioral responses. This dynamie gating meehanism helps switeh between updating 
existing information relative to ineoming stimulus, and proteeting information in the faee of 
interferenee. 


8.4 Modeling the PFC-BG system 

To eneode these features in a eomputational system, we will need a basal ganglia model, 
with two types of neurons. Go, and NoGo, eontrolling a pre-frontal eortex model. The 
learning problem boils down to teaehing these neurons when to hre, aeeording to the sensory 
input they reeeive from an equivalent to the posterior eortex, and the maintained pre-frontal 
eortex aetivations. 

However, onee information is eneoded into the pre-frontal eortex, we would not be 
able to know how beneheial it was. Feedbaek on whether that memory was helpful or 
inappropriate will only eome when that memory is reealled, later in time. Knowing whieh 
prior events were performed for the subsequent good, or bad performanee will be hard to 
extrapolate, yet is eritieal for training our network. The pre-frontal eortex model will have 
to be divided into stripes, eontaining independent loops. The network should then deeide 
what memories to eneode in eaeh stripe, reinforeing those who eontribute to its sueeess 
during memory reeall. This ean be done through elassieal neural network algorithms, sueh 
as baekpropagation through time[86], although here, the problem is eonsiderably more 


36 




neurons in the dorsal striatum. STN represents the subthalamic nucleus. The 3 segments in the thalamus 
are VA for ventral anterior nucleus, VL for ventral lateral nucleus, and MD for the dorso-medial nucleus. 
GPe is the external segment of the globus pallidus. SNr is the substantia nigra pars reticulata. The thalamus 
is bidirectionally excitatory with the frontal cortex. The SNr is tonically active, and inhibates the thalamus. 
When Go neurons activate, they inhibit the SNr, and stop the inhibition of the thalamus. NoGo neurons 
indirectly inhibit the SNr, by inhibiting the GPe, which usually inhibits the SNr, this in turn activates the 
thalamus. Image courtesy of [23]. 


complex because of the different types of nodes used, the latent feedback, and the unique 
stripe-loop architecture. 

The brain solves the latent feedback problem by predicting the subsequent rewards of 
a certain stimulus [72]. Neurons connected to the basal ganglia, in the ventral tegmental area, 
and the substantia nigra pars compacta predict the dopamine stimulus, thereby reinforcing 
Go firing, when such maintenance usually leads to a positive reward. To model these 
features, we can use the ’’Primary Value Learned Value”[57], or PVLV, Pavlovian learning 
algorithm. In this algorithm, the Primary Value controls performance and learning during 
only the primary rewards or unconditioned stimuli. On the other hand, the learned value 


37 










Figure 5: Inner and outer loops represented through the 1 - 2 -AX-CPT task. The stimuli is presented in 
a sequence, and the participants respond by pressing one of two buttons. The right button, R, is pressed when 
the subject is presented with the target sequence. The left button, L, is pressed when they are presented with 
any other sequence. Image courtesy of [23]. 


processes the received stimulus, and learns about the conditioned stimuli that are reliably 
associated with rewards. 

The unique stripe-loop arehiteeture ean be explieitly observed in the pre-frontal eortex 
during working memory reeall tests. This indieates that for the brain, as well as the models 
that should be used to model it, information that must be updated at different points in time, 
must also be represented in different parts of the pre-frontal eortex[38]. Anterior areas of the 
pre-frontal eortex are seleetively aetivated for outer-loop information, while the dorsa-lateral 
parts are aetive for inner-loop information. The model used should therefore employ parallel 
loops, that are updated independently by external struetures. A dynamie, eontext sensitive, 
gating mechanism should be used to update, or protect, relevant information in the face of 
respeetively relevant, or irrelevant inputs. 


9 Modeling the cortical architecture 

9.1 Multiple layers of representation 

In order to make the fine distinctions required to control behavior and other types of 
processing, the eerebral cortex is composed of 6 layers, each of which has a distinct role. 


38 










The cortex needs an efficient way of adapting the synaptic weights across these multiple 
layers of feature-detecting neurons. BPTT is a possible model of how neural network could 
learn multiple layers of representations, however it is only suited to classify labeled training 
data, and supervised learning. 

For this reason, a neural network that contains both bottom-up recognition connections 
and top down generative connections is needed. Bottom-up connections are used to recognize 
the input data, whereas top down generative connections fulfill their roles by generating 
a whole distribution of data vectors over recurrent passes. In other words, bottom-up 
connections help in determining the activations for certain present features, and places 
them in layers. Top-down connections try to generate the training data using the activations 
learned from the bottom-up pass. The learning algorithm used for the top-down connections 
is an inverted learning model using negative feedback loops[56]. These passes are done 
while adjusting the weightings on the connections in order to maximize performance and 
optimization. 

This model is in line with several neuroscientific observations, including single cell 
recordings, and the theory of reciprocal connectivity between cortical areas, which indicates 
to a hierarchical progression of complexity in the cortex. 

9.2 Training unlabeled data 

The external data we receive as humans are random. Most of the time, there are no labeled 
data being presented to us. We aren’t able to manually structure our brains to perform a 
certain task. In terms of modeling, the goal is translated as finding hidden structures in 
unlabeled data. This problem boils down to probabilistic density estimation. A child holds 
an apple, and hears it being referred to as ’’Apple” several times, making the connection 
that this object is indeed called an ’’Apple”. A probabilistic distribution is inferred over 
the various possible settings of the hidden variables, by doing several recurrent passes[27]. 
However, a neural network with only one hidden layer is too simplistic, and would not 
be able to handle such rich and complex structures, nor would it be able to correctly deal 
with the high-dimensional operations at task. One way to solve this problem is by stacking 
restricted Boltzmann machines, in order to form a deep belief net. However this is not 
optimal. The hidden variables use discrete values, making exact inference possible, but very 
limited to the domain and learned scenarios, causing reduced generalization capacities. 

Going down the probabilistic road, and simplifying our way to the lowest common 
denominator, our problem boils down to using either generative models, or a discriminative 
models. Generative models are full probabilistic models. They can be used to generate the 
probabilistic prediction of any, and all variables in the model. They express complex, and 
complete, relationships between the observed and target variable, and are able to categorize 
data statistically, to determine the mostly likely input that would generate such a signal. The 
other type of probabilistic modeling, discriminative models, only model the target variables 
conditionally, in views of the observed data-sets. This is a limiting probabilistic distribution. 


39 



and does not give the operator insight into how the data is being generated. Its role is to 
simply eategorize a signal. 

For a long time, the general eonsensus was that generative models were inferior to 
their diseriminative eounterparts[36], however that turned out to be false. Generative models, 
while they may take more time to resolve, have superior eomputation powers, due to their 
transpareney towards the full probabilistie distribution. A generative model ean infer a 
diseriminative model from its data, whereas the opposite is not true. 

Modeling the mind should therefore rely on unsupervised learning, using many hidden 
layers in a non-linear distributive neural network. Networks build a progressively more 
eomplex hierarehy using sensory data, by passing it through a generative model. In order to 
solve the problem of reeiproeal eonneetivity through top-down and bottom-up eonneetions, 
a good eandidate to that are PCSPs. We will talk briefly about them, before moving onto a 
larger eoneept whieh makes use of them. 

9.3 Parallel constraint satisfaction processes 

Parallel eonstraint satisfaetion proeesses[65], PCSP for short, are a type of network used to 
model eognitive dissonanee, and other eontemporary issues in soeial, and Gestalt psyehology. 
Aetivations pass around symmetrieally eonneeted nodes, until the aetivation of all the nodes 
relaxes into a state. What we mean by all nodes relax is that all eonstraints are met when the 
ehange pereentage between the previous, and eurrent state of eaeh node asymptotes. This 
final state satisfies all the eonstraints set among the nodes, whieh are eonneeted with eaeh 
otehrs, and represent hypotheses detailing the presenee or absenee of eertain features. These 
eonneetions serve as a ground to prove hypotheses as being either eonsistent within eaeh 
others, or ineonsistent and eontradietory. 

Aetivations spread in parallel among the nodes. Nodes with positive links, or eonnee¬ 
tions, will aetivate eaeh others reeiproeally. Nodes with negative links inhibit eaeh others. 
This results in a global and viable solution to the eonstraints among the entire set of nodes. 

9.4 Part-whole hierarchies 

Representing simple struetures in a distributive representation network is eomplex. Objeets 
are not symbols, or singular entities, they are instead represented as patterns aeross many 
units. It is even harder to represent eomplex struetures, sueh as sentenees, where the 
meaning is eomposed of several eonstituents, with relations interweaving aeross several 
of it sub-elements. Attentional foeus must therefore be brought on the eonstituents of that 
strueture, while at the same time maintaining the whole meaning in eheek. To deal with that 
problem, one potential solution is using part-whole hierarehies, where objeets at one level 
are eomposed of inter-related objeets at the next level down. 

In the elassieal digital eomputer, hierarehieal data struetures are eomposed of a set of 


40 



fields, or variables, eontaining pointers to the eontent of the fields, their values. Addresses 
are a small representation of the symbol they point to, and many symbols ean be put together 
to ereate a fully-artieulated representation. This simple representation eomes at the eost of 
being limited by the Von Neumann bottleneek[5], in whieh the system runs out of hardware 
spaee. 

A famous aeeount of disagreement between eonneetionists and symbolie arehiteeture 
advoeates is that of Fodor and Pylyshyn[15]. They believed that the mind operates around 
rule governed formulations, and the manipulation of sentenees in an inner linguistie eode. 
Conneetionism was seen as a step baekwards, devoid of any real eomputation, resembling 
assoeiationism, where an input is assoeiated with its output through trial and error. However, 
as more progress was being made in both human eognition, and eonneetionism, many of 
their points didn’t hold through. Neural networks ean generalize to non test-ease seenarios, 
ereate an internal representation to express regularities in the domain, and use isomorphisms 
in order to eneode struetures more effieiently. This indieates that distributed representations 
have mueh more power than previously thought. 

In a PDF network, patterns also allow remote aeeess to fuller representations. Most 
proeessing ean be done using parallel eonstraint satisfaetion proeesses, in whieh an aetivation 
passes around symmetrieally eonneeted nodes, until the states satisfy the eonstraints set 
among the nodes. 

PCSP results in an inflexible inner loop, leading fo fwo differenf ways of performing 
inferenee[65]. 

9.4.1 Intuitive inferences, and rational inferences 

Inferences are performed efficiently by allowing the neural network to settle down into a 
stable state. This allows the network to reach a conclusive result. Inferences are influenced 
by the network’s weightings, past knowledge, and connection strength. Some operations 
require a single settling of the network to resolve, such cases are referred to as intuitive 
inferences. 

Other operations require a more serial approach, where several intuitive inferences 
are performed in sequence, this is known as rational inferences. Rational inferences involve 
sequences of changes in the way in which parts of the task are mapped into the network. It 
is important to note that a single task can sometimes be performed in different ways, either 
requiring intuitive inference, or rational inference, depending on the degree of complexity 
of its iterations. 

9.4.2 Internal network structure 

In order to build a part-whole hierarchy, it is possible to map out the internal representation 
of the network by hand. The algorithm then sorts the inputs, applies the necessary steps, and 
derives conclusions using our base framework. It is also possible to give the network the 


41 



freedom to use its experience of a set of propositions to construct its own internal representa¬ 
tion of concepts. Propositions are presented in a neutral way, and the network translates the 
input into active units, using its own associations. Similar patterns of activity will therefore 
represent similar relations, and the representation will be built around correlations between 
the given input. This leads to better generalization due to the network searching for features 
that make it simple to express the regularities of the domain. 

Instead of using a separate bit for each possible object that could be stored in the 
memory, each bit is shared between many objects. A bit is applied to one position of the 
object, the combination of its identity and position within the object activates a role specific 
unit that contributes to the recognition of the object. The recognition model is shared across 
the entities, within one level. This model is called within-level timesharing[45]. 

In order to share knowledge in the connections between different levels in the part- 
whole hierarchy, it is necessary to implement flexible mapping of pieces of the task into a 
module of the network. This is done by choosing a node as the current whole, the other units 
are devoted to either describing its global properties, or describing its major constituents. 
The pattern of activity that represents a same object is different depending on the whole, or 
context, it was presented in[28]. This implementation readily violates the assumption that 
each entity in the world has a unique corresponding representation in the network, which 
was an essential part of classical symbolic architectures. One system that hits a lot of these 
points, and provides a good biologically plausible architecture for computing is HTMs. 


9.5 Hierarchical Temporal Memory 

Hierarchical Temporal Memory[22], HTM for short, is a bio-inspired machine learning 
model, that makes use of distributed representations to store information. Its main goal is 
modeling the neocortex, using nodes that are arranged in columns, layers, regions, and in a 
hierarchy. 

It is compromised of two separate stages, a spatial pooler, and a temporal pooler. The spatial 
pooler creates a sparse binary representation from the network’s input. The temporal pooler 
makes predictions, that are not provided in the input, in response to the vector sequences. 
The predictions can range from next step predictions, to predictions into the future several 
steps ahead. 

The model represents dendrites using non-linear segments, and mimics them using a 
sum and threshold function. These structures connect with other neurons, or nodes, forming 
an equivalent of a synapse, leading a node to fire when enough activations have been received 
by any segments, through an OR operator. The segments, and their weights can be learned, 
and changed adaptively with the received inputs. Their number for a given node, as well as 
their total number in the structure can change with time to accommodate machine learning. 

HTMs are trained on large sets of patterns and sequences, where information is stored 
in a distributed fashion, following what was described previously. The network has the 


42 



freedom to choose how, and where the information is stored, making for better generalization. 
HTMs have VSA features[58], and are able to achieve bundling, by storing several objects 
in one vector. They are also able to achieve variable binding, where a role can be used to 
obtain the values of the filler. 


10 Constructing the Code: A recap 

Most of the work that has been carried out in the field of artificial neural networks, and 
presented here, has targeted very specific features. From language processing, to visual 
imagery, and models of memory, none of these models encompass the entirety of the mind. 
Using the limited knowledge presented in this review, we will attempt to make our own 
model of a general brain architecture. As we have previously seen, the brain is limited by 
several factors, working memory is one example of them. In the future we may want to 
exceed them, but for now, to insure a one to one scale, we will work within these boundaries. 

Symbolic and localist architectures are not flexible enough, and would not be able to 
deal with the large amounts of data that our brains are presented with on a daily basis. For 
this reason, our model will be using a distributed representation. Entities will be represented 
by patterns, and spread across many nodes. Variable binding is an essential part of human 
cognition, but it used to be a challenge to deal with in distributed connectionist models. 
However vector symbolic architectures have proved that they can handle variable binding 
to a fairly decent degree. In order to encode enough features, and have enough robustness 
and plasticity in our networks, we will opt to use high-dimensional vectors. The high¬ 
dimensional vectors used should all be of the same dimension, to enable us to use the wide 
array of pre-existing, and thoroughly tested mathematical knowledge on them. Our brains 
are interconnected, with constant signal propagation. We will therefore model them using a 
recurrent system of artificial neural networks. 

Although there is a large number of neuron types in the brain, most of them are due 
to their anatomical, morphological and chemical differences. The two types of neurons 
that absolutely have to be accounted for, are the classical specificity neurons, and mixed 
selectivity neurons. Classical specificity neurons are simple, and integrate activity coming 
from different regions in a linear way, in order to activate, or inhibit a certain structure. They 
respond specifically to one stimulus or task. Unfortunately, currently, there is not enough 
data to determine exactly where the classical neurons reside in the brain, and exactly what 
are their advantages over mixed selectivity neurons. For this purpose, and the purpose of 
simplification, we will assume that classical neurons are used exclusively for the input, and 
output of data. They will be modeled using linear nodes, and will help transfer data into, 
and out of the neural architecture. 

Mixed selectivity neurons are overwhelmingly present in higher-order structures. They 
have complex, and non-linear activation patterns. These are the neurons responsible for most 
of the logical computation done in our brain. Therefore, they will be modeled as hidden 


43 



nodes, with non-linear aetivation funetions. Our input and output layers will be elassieal 
seleetivity neurons, represented by linear nodes. Our hidden layers will be using mixed 
seleetivity neurons, represented by non-linear nodes. Nodes will use a sum and threshold 
funetion, through a leaky integrate and fire model. They will be eonneeted with eaeh others 
by links, playing the role of synapses. The number of links will be variable, and subjeet to 
maehine learning. New eonneetions ean be ereated or dismantled after long-term training. 

Inserting gating nodes seems an essential part of any neural network. We do not want 
oseillating weights. We do not want any gradients to vanish, or explode. And most 
importantly, we do not want the network to be heavily influeneed by reeent inputs and 
modifieations. However, we doubt there are gating neurons in the brain. As a result, all 
nodes, whether they are gates, memory eells, or regular neurons, will have to be identieal. It 
is therefore neeessary to use fiexible mappings between the entities. Eaeh node will play the 
role of the whole, and all the units eonneeting to it will be used to gate information in, and 
out of it. Eaeh node will have its own set of representations, and a representation of a eertain 
objeet will be different between two nodes. Using these eharaeteristies, it is theoretieally 
possible to have eaeh node behave differently in response to the signal it is reeeiving. A 
node will know if it is being used as a gate node, or eomputing node, or both at the same 
time in response to different elusters of nodes, based on the input pattern it has reeeived, 
and based on its previous aetivations. However, how this will be aehieved mathematieally 
remains to be seen. 

The eortex is divided into separate areas, and the eoneept that eaeh area eontrols a 
more or less speeifie funetions has been tested several time, and is agreed upon. In all 
humans, the visual eortex is loeated in the Brodmann areas 17, 18, 19, while the auditory 
eortex is loeated in Brodmann areas 41 and 42. However we believe that this is not due to a 
speeial internal strueture of the brain. This is simply due to how exterior stimuli is fed into 
the brain following its speeifie paths. Although not diseussed here, these areas ean ehange if 
the signal was rerouted from one strueture to others [59]. 

But one eharaeteristie that is eonstant, for a seemingly unwarranted reason, is the distinetion 
between the eortex, and the eentral gray nuelei(CGNs). CGNs are an intereonneeted 
set of neurons that eompile aetivations from different areas of the brain, perform their 
eomputations, then reroute them to other struetures. It is hard to believe that sueh unique, 
yet eonsistent elements would arise through training. We are led to believe that there is an 
internal predisposition to sueh a strueture, and we should therefore enforee it into our model. 

The eerebral eortex has a laminar strueture, and is eomposed of 6 layers forming 
mieroeireuits. These mieroeireuits are grouped into eortieal minieolumns. Eaeh layer 
has speeial eharaeteristie eonneetions with the others. This ean be seen rather eonstantly 
through the eortex. Eor this purpose, we will be staeking unrestrieted Boltzmann maehines 
to model this arehiteeture. Unrestrieted Boltzmann Maehines are identieal to RBMs, with 
the exeeption of the presenee of eonneetions between hidden units. Central grey nuelei will 
be modeled using fully intereonneeted, or highly intereonneeted reeurrent neural networks. 


44 



The connections should be flexible and differentiable end to end. The network will be 
trained through unsupervised learning, using unlabeled data, in ways comparable to how 
the brain would receive its data. Because there are generally no restrictions on the network, 
the network will construct its own internal representation. Similar entities will end up with 
similar vectors. 

In order to effectively deal with the extremely large amounts of data that a model such 
as the brain is bombarded with on a daily basis, we will be using Holographic Reduced 
Representations to conserve memory space. A variable, and its value will be bound together 
through circular convolution, in order to form one vector. Vectors can be compressed 
together to sequence them, or indicate similar entities. This will also allow us to maintain 
the sequential recall order for complex tasks. In order to save time, and be able to increment 
steps at a higher rate, we will be using approximate operations. This includes using FFT 
instead of circular convolution on HRRs. These approximations will require a clean-up 
memory, so an autoassociative storage method will be employed. 

As we have seen throughout this review, tremendous ground has been covered, and astonish¬ 
ing work has been done so far in the field of computational neuroscience. How far we will 
be able to go in the following years remains to be seen. The path to the future is still very 
much obscure, but with the recent surge of interest in the concept of mind modeling, and 
the combined efforts of researchers from several subspecialities, this goal will definitely be 
within our reach in the near future. 


45 



References 


[1] : Ackley, David H., Geoffrey E. Hinton, and Terrence J. Sejnowski. ”A learning algorithm 
for boltzmann machines.” Cognitive science 9.1 (1985): 147-169. 

[2] : Alexander, Garrett E., Mahlon R. DeLong, and Peter L. Strick. ’’Parallel organization 
of functionally segregated circuits linking basal ganglia and cortex.” Annual review of 
neuroscience 9.1 (1986): 357-381. 

[3] : Asaad, Wael E, Gregor Rainer, and Earl K. Miller. ’’Neural activity in the primate 
prefrontal cortex during associative learning.” Neuron 21.6 (1998): 1399-1407. 

[4] : Atallah, Hisham E., Michael J. Erank, and Randall C. O’Reilly. ’’Hippocampus, cortex, 
and basal ganglia: Insights from computational models of complementary learning systems.” 
Neurobiology of learning and memory 82.3 (2004): 253-267. 

[5] : Bacus, J. ’’Can programming be liberated from the von Neuman style.” Comm. ACM 
21 (1978): 899. 

[6] : Bellman, Richard, et al. Adaptive control processes: a guided tour. Vol. 4. Princeton: 
Princeton university press, 1961. 

[7] : Bengio, Yoshua, Patrice Simard, and Paolo Erasconi. ’’Learning long-term dependencies 
with gradient descent is difficult.” Neural Networks, IEEE Transactions on 5.2 (1994): 157- 
166. 

[8] : Bengio, Yoshua, and Sarny Bengio. ’’Modeling High-Dimensional Discrete Data with 
Multi-Layer Neural Networks.” NIPS. Vol. 99. 1999. 

[9] : Caid, William R., Susan T. Dumais, and Stephen I. Gallant. ’’Learned vector-space 
models for document retrieval.” Information processing and managements 1.3 (1995): 419- 
429. 

[10] : Collobert, Ronan, et al. ’’Natural language processing (almost) from scratch.”The 
Journal of Machine Learning Research 12 (2011): 2493-2537. 

[11] : Das, Sreerupa, C. Lee Giles, and Guo-Zheng Sun. ’’Learning context-free grammars: 
Capabilities and limitations of a recurrent neural network with an external stack memory.” 
Proceedings of The Eourteenth Annual Conference of Cognitive Science Society. Indiana 
University. 1992. 

[12] : Desjardins, Guillaume, et al. ’’Tempered Markov chain Monte Carlo for training of 
restricted Boltzmann machines.” International Conference on Artificial Intelligence and 
Statistics. 2010. 

[13] : Donoho, David L. ’’High-dimensional data analysis: The curses and blessings of 
dimensionality.” AMS Math Challenges Lecture (2000): 1-32. 

[14] : Eich, Janet M. ”A composite holographic associative recall model.”Psychological 
Review 89.6 (1982): 627. 


46 



[15] : Fodor Jerry A., and Zenon W. Pylyshyn. ’’Connectionism and cognitive architecture: 
A critical analysis.” Cognition 28.1 (1988): 3-71. 

[16] : Gallant, Stephen L, and T. Wendy Okaywe. ’’Representing objects, relations, and 
sequences.” Neural computation 25.8 (2013): 2038-2078. 

[17] : Gers, Felix A., Jrgen Schmidhuber, and Fred Cummins. ’’Learning to forget: Continual 
prediction with LSTM.” Neural computation 12.10 (2000): 2451-2471. 

[18] : Gers, Felix, and Jrgen Schmidhuber. ’’Recurrent nets that time and count.”Neural 
Networks, 2000. IJCNN 2000, Proceedings of the lEEE-INNS-ENNS International Joint 
Conference on. Vol. 3. IEEE, 2000. 

[19] : Giles, C. Lee, et al. ’’Learning and extracting finite state automata with second-order 
recurrent neural networks.” Neural Computation 4.3 (1992): 393-405. 

[20] : Graves Alex, Greg Wayne, and Ivo Danihelka. ’’Neural Turing Machines.” arXiv 
preprint arXiv: 1410.5401 (2014). 

[21] : Hadley, Robert E. ’’The problem of rapid variable creation.” Neural computation2L2 
(2009): 510-532. 

[22] : Hawkins, Jeff, Subutai Ahmad, and Donna Dubinsky. ’’Hierarchical temporal mem¬ 
ory including HTM cortical learning algorithms.” Techical report, Numenta, Inc, Palto 
Alto(2010). 

[23] : Hazy, Thomas E., Michael J. Frank, and Randall C. OReilly. ’’Banishing the homuncu¬ 
lus: making working memory work.” Neuroscience 139.1 (2006): 105-118. 

[24] : Hebb, Donald Olding. The organization of behavior: A neuropsychological theory. 
Psychology Press, 2005. 

[25] : Hemndez-Lpez, Salvador, et al. ”D1 receptor activation enhances evoked discharge in 
neostriatal medium spiny neurons by modulating an L-type Ca2-i- conductance.” The Journal 
of neuroscience 17.9 (1997): 3334-3342. 

[26] : Hinton, Geoffrey E. ’’Learning distributed representations of concepts.’’Proceedings of 
the eighth annual conference of the cognitive science society. Vol. 1. 1986. 

[27] : Hinton, Geoffrey E. ’’Learning multiple layers of representation.” Trends in cognitive 
sciences 11.10(2007): 428-434. 

[28] : Hinton, Geoffrey E. ’’Mapping part-whole hierarchies into connectionist networks.” 
Artificial Intelligence 46.1 (1990): 47-75. 

[29] : Hochreiter, Sepp, and Jrgen Schmidhuber. ’’Long short-term memory.” Neural compu¬ 
tation 9.8 (1997): 1735-1780. 


47 



[30] : Hodgkin, Alan L., and Andrew F. Huxley. ”A quantitative deseription of membrane 
eurrent and its applieation to eonduetion and exeitation in nerve.’The Journal of physiology 
117.4(1952): 500-544. 

[31] : Holzer, Adrian, Patriek Eugster, and Benot Garbinato. ’’Evaluating implementation 
strategies for loeation-based multieast addressing.” Mobile Computing, IEEE Transaetions 
on 12.5 (2013): 855-867. 

[32] : Hopfield, John J. ’’Neural networks and physieal systems with emergent eolleetive 
eomputational abilities.” Proeeedings of the national aeademy of seienees 79.8 (1982): 
2554-2558. 

[33] : Hsu, Yuan-Yih, and Chien-Chuen Yang. ”A hybrid artifieial neural network-dynamie 
programming approaeh for feeder eapaeitor seheduling.” Power Systems, IEEE Transaetions 
on 9.2 (1994): 1069-1075. 

[34] : Hulme, Charles, et al. ’’The role of long-term memory meehanisms in memory span.” 
British Journal of Psyehology 86 (1995): 527. 

[35] : Ingster, Yuri I., Christophe Pouet, and Alexandre B. Tsybakov. ’’Classifieation of sparse 
high-dimensional veetors.” Philosophieal Transaetions of the Royal Soeiety of London A: 
Mathematieal, Physieal and Engineering Seienees367.1906 (2009): 4427-4448. 

[36] : Jordan, A. ”On diseriminative vs. generative elassifiers: A eomparison of logistie 
regression and naive bayes.” Advanees in neural information proeessing systems 14 (2002): 
841. 

[37] : Kanerva, Pentti. ’’Hyperdimensional eomputing: An introduetion to eomputing in 
distributed representation with high-dimensional random veetors.” Cognitive Computation 
1.2(2009): 139-159. 

[38] : Koeehlin, Etienne, et al. ’’Dissoeiating the role of the medial and lateral anterior 
prefrontal eortex in human planning.” Proeeedings of the National Aeademy of Seienees 
97.13 (2000): 7651-7656. 

[39] : Kohonen, Teuvo. ’’Correlation matrix memories.” Computers, IEEE Transaetions on 
100.4(1972): 353-359. 

[40] : Kosko, Bart. ’’Bidireetional assoeiative memories.” Systems, Man and Cybemeties, 
IEEE Transaetions on 18.1 (1988): 49-60. 

[41] : Levitt, Jonathan B., et al. ’’Topography of pyramidal neuron intrinsie eonneetions in 
maeaque monkey prefrontal eortex (areas 9 and 46).” Journal of Comparative Neurology 
338.3 (1993): 360-376. 

[42] : Maass, Anne, et al. ’’Laminar aetivity in the hippoeampus and entorhinal eortex related 
to novelty and episodie eneoding.” Nature eommunieations 5 (2014). 


48 



[43] : Malsburg, Cvd. ’’The correlation theory of brain function.” Internal Report MPI fr 
biophysikalische Chemie 81 (1981). 

[44] : Marcus, G. F. ’’The algebraic mind: Reections on connectionism and cognitive science.” 
(2000). Cambridge, MA: MIT Press. 

[45] : McClelland, James L., and David E. Rumelhart. ”An interactive activation model of 
context effects in letter perception: I. An account of basic findings.’’Psychological review 
88.5 (1981): 375. 

[46] : McClelland, James L., David E. Rumelhart, and Geoffrey E. Hinton. ’’The appeal of 
parallel distributed processing.” Cambridge, MA: MIT Press, 1986. 

[47] : McCulloch, Warren S., and Walter Pitts. ”A logical calculus of the ideas immanent in 
nervous activity.” The bulletin of mathematical biophysics 5.4 (1943): 115-133. 

[48] : McNab Eiona, and Torkel Klingberg. ’’Prefrontal cortex and basal ganglia control 
access to working memory.” Nature neuroscience 11.1 (2008): 103-107. 

[49] : Miller, George A. ’’The magical number seven, plus or minus two: some limits on our 
capacity for processing information.” Psychological review 63.2 (1956): 81. 

[50] : Mink, Jonathan W. ’’The basal ganglia: focused selection and inhibition of competing 
motor programs.” Progress in neurobiology 50.4 (1996): 381-425. 

[51] : Monner, Derek, and James A. Reggia. ”A generalized LSTM-like training algorithm 
for second-order recurrent neural networks.” Neural Networks 25 (2012): 70-83. 

[52] : Murdock, Bennet B. ”A distributed memory model for serial-order in form a- 
tion.”Psychological Review 90.4 (1983): 316. 

[53] : Murdock, Bennet B. ’’Serial-order effects in a distributed-memory model.” (1987). In 
David S. Gorfein and Robert R. Ho man, editors, memory and learninig. The Ebbinghaus 
Centennial Conference, pages 277310. Lawrence Erlbaum Associates 

[54] : Neumann, Jane. ’’Learning the systematic transformation of holographic reduced 
representations.” Cognitive Systems Research 3.2 (2002): 227-235. 

[55] : Newell, Allen. ’’Physical Symbol Systems*.” Cognitive science 4.2 (1980): 135-183. 

[56] : Oh Jong-Hoon, and H. Sebastian Seung. ’’Learning Generative Models with the Up 
Propagation Algorithm.” Advances in Neural Information Processing Systems. 1998. 

[57] : O’Reilly, Randall C., et al. ”PVLV: the primary value and learned value Pavlovian 
learning algorithm.” Behavioral neuroscience 121.1 (2007): 31. 

[58] : Padilla, Daniel E., and Mark D. McDonnell. ”A neurobiologically plausible vector 
symbolic architecture.” Semantic Computing (ICSC), 2014 IEEE International Conference 
on. IEEE, 2014. 


49 



[59] : Pallas, Sarah L., Anna W. Roe, and Mriganka Sur. ’’Visual projections induced into 
the auditory pathway of ferrets. 1. Novel inputs to primary auditory cortex (AI) from the 
LP/pulvinar complex and the topography of the MGNAI projection.” Journal of Comparative 
Neurology 298.1 (1990): 50-68. 

[60] : Pineda, Fernando J. ’’Dynamics and architecture for neural computation.”Journal of 
Complexity 4.3 (1988): 216-245. 

[61] : Plate, Tony ’’Estimating analogical similarity by dot-products of Holographic Reduced 
Representations.” Advances in neural information processing systems (1994): 1109-1109. 

[62] : Plate, Tony. ’’Holographic Reduced Representations: Convolution Algebra for Compo¬ 
sitional Distributed Representations.” IJCAI. 1991. 

[63] : Pollack, Jordan B. ”On connectionist models of natural language processing.” Com¬ 
puting Research Laboratory, New Mexico State University, 1987. 

[64] : Pollack, Jordan B. ’’Recursive distributed representations.” Artihcial Intelligence46.1 
(1990): 77-105. 

[65] : Read Stephen J., Eric J. Vanman, and Lynn C. Miller. ’’Connectionism, parallel 
constraint satisfaction processes, and gestalt principles:(Re) introducing cognitive dynamics 
to social psychology.” Personality and Social Psychology Review 1.1 (1997): 26-53. 

[66] : Rigotti, Mattia, et al. ’’Internal representation of task rules by recurrent dynamics: the 
importance of the diversity of neural responses.” Frontiers in computational neuroscience 4 
( 2010 ). 

[67] : Rigotti, Mattia, et al. ’’The importance of mixed selectivity in complex cognitive tasks.” 
Nature 497.7451 (2013): 585-590. 

[68] : Robinson, A. J., and Frank Fallside. The utility driven dynamic error propagation 
network. University of Cambridge Department of Engineering, 1987. 

[69] : Rosenfeld, Ronald, and David S. Touretzky. ’’Coarse-coded symbol memories and 
their properties.” Complex Systems 2.4 (1988): 463-484. 

[70] : Roweis, Sam. ’’Boltzmann machines.” lecture notes (1995). 

[71] : Rudolph-Lilith, Michelle, and Lyle E. Muller. ’’Aspects of randomness in biological 
neural graph structures.” BMC Neuroscience 14.Suppl 1 (2013): P284. 

[72] : Schultz, Wolfram. ’’Predictive reward signal of dopamine neurons.” Journal of 
neurophysiology 80.1 (1998): 1-27. 

[73] : Seghier, Mohamed L. ’’The angular gyrus multiple functions and multiple subdivisions.” 
The Neuroscientist 19.1 (2013): 43-61. 

[74] : Siegelmann, Hava T., and Eduardo D. Sontag. ”On the computational power of neural 
nets.” Journal of computer and system sciences 50.1 (1995): 132-150. 


50 



[75] : Smolensky, Paul. ”On the proper treatment of eonneetionism.” Behavioral and brain 
seienees 11.01 (1988): 1-23. 

[76] : Smolensky, Paul. ’’Parallel distributed proeessing: explorations in the mierostrueture 
of eognition, vol. 1. ehapter Information proeessing in dynamieal systems: foundations of 
harmony theory.” MIT Press, Cambridge, MA, USA 15 (1986): 18. 

[77] : Smolensky, Paul. ’’Tensor produet variable binding and the representation of symbolie 
struetures in eonneetionist systems.” Artifieial intelligenee 46.1 (1990): 159-216. 

[78] : Sowell, Elizabeth R., et al. ’’Mapping eortieal ehange aeross the human life span.” 
Nature neuroseienee 6.3 (2003): 309-315. 

[79] : Touretzky, David S., and Geoffrey E. Hinton. ”A distributed eonneetionist produetion 
system.” Cognitive Seienee 12.3 (1988): 423-466. 

[80] : Treisman, Anne. ’’Solutions to the binding problem: progress through eontroversy and 
eonvergenee.” Neuron 24.1 (1999): 105-125. 

[81] : Turing, Alan Mathison. ”On eomputable numbers, with an applieation to the Entsehei- 
dungsproblem.” J. of Math 58.345-363 (1936): 5. 

[82] : Turing, Alan Mathison. ”On eomputable numbers, with an applieation to the Entsehei- 
dungsproblem. A eorreetion.” Proeeedings of the London Mathematieal Soeiety 2.1 (1938): 
544-546. 

[83] : Verleysen, Miehel. ’’Learning high-dimensional data.” Nato Seienee Series Sub Series 
III Computer And Systems Seienees 186 (2003): I4I-I62. 

[84] : Von Neumann, John. ’’Eirst Draft of a Report on the ED VAC.” IEEE Annals of the 
History of Computing 4 (1993): 27-75. 

[85] : Wagemans, Johan, et al. ”A eentury of Gestalt psyehology in visual pereeption: II. 
Coneeptual and theoretieal foundations.” Psyehologieal bulletin 138.6 (2012): 1218. 

[86] : Werbos, Paul J. ’’Generalization of baekpropagation with applieation to areeurrent gas 
market model.” Neural Networks 1.4 (1988): 339-356. 

[87] : Weston, Jason, Sumit Chopra, and Antoine Bordes. ’’Memory networks.” arXiv 
preprint arXiv: 1410.3916 (2014). 

[88] : Williams, Ronald J., and David Zipser. ”A learning algorithm for eontinually running 
fully reeurrent neural networks.” Neural eomputation 1.2 (1989): 270-280. 

[89] : Willshaw, David. ’’Holography, assoeiative memory, and induetive generalization.” 
(1985). 

[90] : Wolpert, D. ”A eomputationally universal field eomputer wieh is purely linear.” 
Relatrio Tenieo. LA-UR-91-2937, Los Alamos National Laboratory, 1991. 


51 



[91]: Yarom, Yosef, and Jorn Hounsgaard. ’’Voltage fluetuations in neurons: signal or 
noise?.” Physiologieal Reviews 91.3 (2011): 917-929. 


52 



